My role is to manage the epics on the dataproducts I worked on myself the year before. This means that I am responsible for the creation and defining of epics, communicating this with the product teams to implement it, monitor status and help as a subject matter expert, and communicate with the stakeholders. I also work closely with the data engineers and data analists to ensure that the product is delivered on time and meets the requirements of the stakeholders.
Our task was to combine data from various data sources (GOPACS platform and various registers from regional grid operators) and transform this data into the data warehouse for subsequent reporting in Tableau. The reports contained information on all non GOPACS registered connections and their sizes to target actively for recruitment to have more flexible capacity to solve congestion problems, as well as insights in congestion announcements and solutions. This included data modeling, in Snowflake and dbt, and data transformation in Python and SQL. We also set up the data pipeline in Airflow and Kafka to automate the data flow from the source to the data warehouse.
The strategies were based on balanced strategies on the imbalance market to alway have a certain customizable amount of energy in the batteries to always be able to take advantage of sudden spikes. The strategies were tested on historical data and the results were proposed to the client as a basis for market entry.
When the amount of consumed energy at the end of a year is higher than was invoiced, the consumer has to pay the difference. But on what monthly price is this reconciliation based? With high fluctuations in prices, this can result in significant errors.
We analyzed all invoice data and flagged suspicious invoices that had unusual peaks in consumption that were probably caused by the reconciliation process. Subsequently, we compared the flagged invoices with the actual consumption data to determine the accuracy of the reconciliation process, as well as recalculating a fairer price.
Even though the models and methodology are already developed and tested, the simulation is done locally and ad-hoc, and data may need to be moved/copied manually. This would not be a real issue if they run the model just once a year. However, to develop new features and include new data the need an easier way of developing and testing.
To visualize the input data in an already existing Tableau server, we set up a staging database to store the input data and processed input data, and move it automatically to Tableau instead of the manual file upload used previously. Besides that, we built a connection with their ShareFile server via API to connect various scripts to a centralized data source.
Finally, we developed a GitLab CI pipeline to automate the testing of their methodology. In particular, the pipeline runs the new scripts on a set of test data and compares the output to a previous version via the API to check for any discrepancies. This pipeline also checked code formatting and style to improve the readability of all scripting efforts.
I built and maintained a job board webscraper that ran daily in 2023. This was a docker container deployed on GCP via K8S on GitLab. After a job ad text and additional information was scraped they were stored in our Postgres DB.
The next step was to classify the job ad text based on the skills and tasks present in the text. This means that a green company description does not count towards the job itself being green.
Our method was to match the job ad texts with a exstensive set of 570 green skills from the ESCO framework. Because this framework was made by the EU it is available in several languages; useful for job ads across different countries.
The first version of our model used Elasticsearch to match word stems efficiently while losing out on word order. We manually labelled a 200 job ad set to evaluate our model. After qualitative inspection we discovered that the complexity in job ad texts was too large for simple word matching.
The second version used OpenAI's LLM (chatgpt-3.5-turbo) in combination with a vector database of embedded green skills as a knowledge base to refer to when deciding whether a job ad text contained the green skills, making the job ad green.
This version performed significantly better and was great for demos as it explained why the job ad was green and linked to the specific top three green skills found.
Our final version was a lightweight version to save on (costs, runtime and carbon emissions) used to do the classification on all ~50k job ads. This version only contained the encoder part of the LLM and computed the cosine similariy of the job ad text embeddings and the green skill embedding from the vector DB.
If this similarity passed a threshold (optimized through our small test set), the job ad was considered green and the green skill with highest similarity was added as explanation/proof.
Finally, we divided all 570 (rather specific) green skills by clustering their embeddings to generate more general groups to gain insights in what area companies or sector may be lacking or on track to reach climate goals.
For one of the largest third party servicers of the Netherlands for mortgages and consumer credit, we implemented a mirror reporting system to ensure the continuity of their reporting during and after a system migration for which their clients were not ready. Since the reporting was on loans for large financial institutions, the reporting had to be accurate in order to not disrupt the banks' models after migration.
Firstly, mapping the data from the new system to the old system's which was done in direct collaboration with content experts at our client. Secondly, these mappings were implemented in SQL and SSIS packages to transform the data in the same way as their normal reporting goes to reduce the impact and need of new infrastructure of our solution. Thirdly, we extensively communicated and validated the results of our mapping with the stakeholders, and made adjustments where necessary.
Regulations required the implementation of cybersecurity measures in multiple databases being used by the client's applications. In two rounds I implemented the encryption in SSMS on the client's test environment and subsequently their production environment.
The implementation was done live with the client to ensure their understanding of the process and knowledge of the passwords and certificates they would need to manage the encrypted databases themselves. Implementation included the validation of encryption with failover databases.
When a boiler needs maintenance, a mechanic is dispatched to the location of the boiler with information on the specific boiler and the maintenance that needs to be performed. The company has a large database of boilers and maintenance records. The data quality of this database was unknown and the company wanted to know if the data was reliable and if not, how to improve it.
We performed a detailed analysis on the data quality of the database and validated and/or imputed missing data. As the origin of the data was mostly from input by the mechanics, small issues such as typos and laziness were the probable causes. We used a combination of error correction, statistical methods and business ruling to perform this validation and imputation.
Besides the ad hoc improvements, we also advised the company how to implement a more structured way of inputting data to prevent these issues all together.
In commercial research and development projects, new chemical entities are often disclosed in patent applications, primarily providing information on drug compounds and their targets. However, only a small percentage of these interactions are found in scientific literature within a year, with an average delay of 1 to 3 years before being published. Extracting information about drug targets from patents is challenging due to the obscure language and mentions of unrelated genes and proteins. Manual annotation is currently required for this process, making it time-consuming and expensive.
To address the growing volume of biomedical and chemistry-related publications, automatic approaches, including NLP and text-mining, have become popular. Relevancy scoring aims to predict the importance of an entity to a document based on signals from its context and domain knowledge.
The automatic extraction of information from chemical patents faces challenges, including the absence of a suitable dataset for training deep learning models. Patents' textual complexity and obfuscating language contribute to the limited interest in this field compared to scientific publications. This work focused on predicting the relevancy of drug targets in patents, defining relevant entities as genes or proteins targeted by at least one drug or chemical compound mentioned in the patent.