Darkmode

Select Color

Topic
Sector
Congestion Management Epic Owner
Management, Energy
With the increasing amount of renewable energy sources, the energy market is becoming more volatile. This means that grid operators suffer more and more from energy shortages and surpluses. To prevent these shortages and surpluses, grid operators use congestion management. For our client I manage epics on the dataproducts I worked on myself the year before.


My role is to manage the epics on the dataproducts I worked on myself the year before. This means that I am responsible for the creation and defining of epics, communicating this with the product teams to implement it, monitor status and help as a subject matter expert, and communicate with the stakeholders. I also work closely with the data engineers and data analists to ensure that the product is delivered on time and meets the requirements of the stakeholders.



congestion electricity
Congestion Management Insights
Data Engineering, Energy
With the increasing amount of renewable energy sources, the energy market is becoming more volatile. This means that grid operators suffer more and more from energy shortages and surpluses. To prevent these shortages and surpluses, grid operators use congestion management. For our client we developed several dashboards that gives insights in the congestion management of the GOPACS platform.


Our task was to combine data from various data sources (GOPACS platform and various registers from regional grid operators) and transform this data into the data warehouse for subsequent reporting in Tableau. The reports contained information on all non GOPACS registered connections and their sizes to target actively for recruitment to have more flexible capacity to solve congestion problems, as well as insights in congestion announcements and solutions. This included data modeling, in Snowflake and dbt, and data transformation in Python and SQL. We also set up the data pipeline in Airflow and Kafka to automate the data flow from the source to the data warehouse.



congestion electricity
Energy Market Trading Strategies
Artificial Intelligence, Energy
With the increasing amount of renewable energy sources, the energy market is becoming more volatile. This volatility can be exploited by trading companies to make a profit. For our client we developed and tested several trading strategies to exploit the volatility in the energy market for their pooled home batteries.


The strategies were based on balanced strategies on the imbalance market to alway have a certain customizable amount of energy in the batteries to always be able to take advantage of sudden spikes. The strategies were tested on historical data and the results were proposed to the client as a basis for market entry.



imbalance market trading optimization
Energy Invoice Anomaly Detection
Data Quality, Energy
Reconciliation is done at the end of a year or a contract to rectify the differences between the actual and the invoiced energy consumption. With highly fluctuating energy prices, the reconciliation process is not always accurate.


When the amount of consumed energy at the end of a year is higher than was invoiced, the consumer has to pay the difference. But on what monthly price is this reconciliation based? With high fluctuations in prices, this can result in significant errors. We analyzed all invoice data and flagged suspicious invoices that had unusual peaks in consumption that were probably caused by the reconciliation process. Subsequently, we compared the flagged invoices with the actual consumption data to determine the accuracy of the reconciliation process, as well as recalculating a fairer price.



anomaly detection peaks reconciliation flag pricing
EU Grid Adequacy Forecasting
DevOps, Data Engineering, Energy
Our client runs yearly simulations to forecast the EU Grid Adequacy. To improve the development of new features and the robustness of simulations, we implemented some best ETL practices and developed a GitLab CI pipeline for automated testing.


Even though the models and methodology are already developed and tested, the simulation is done locally and ad-hoc, and data may need to be moved/copied manually. This would not be a real issue if they run the model just once a year. However, to develop new features and include new data the need an easier way of developing and testing. To visualize the input data in an already existing Tableau server, we set up a staging database to store the input data and processed input data, and move it automatically to Tableau instead of the manual file upload used previously. Besides that, we built a connection with their ShareFile server via API to connect various scripts to a centralized data source. Finally, we developed a GitLab CI pipeline to automate the testing of their methodology. In particular, the pipeline runs the new scripts on a set of test data and compares the output to a previous version via the API to check for any discrepancies. This pipeline also checked code formatting and style to improve the readability of all scripting efforts.



ETL, Git, CI/CD, API, Grid Adequacy Forecasting
Green Job Ads Market Analysis
Artificial Intelligence, Natural Language Processing, Data Engineering, Human Resources
In this project we gathered job ad texts from online job boards and developed a method and model to classify the amount of green jobs (per company, country, and sector).


I built and maintained a job board webscraper that ran daily in 2023. This was a docker container deployed on GCP via K8S on GitLab. After a job ad text and additional information was scraped they were stored in our Postgres DB. The next step was to classify the job ad text based on the skills and tasks present in the text. This means that a green company description does not count towards the job itself being green. Our method was to match the job ad texts with a exstensive set of 570 green skills from the ESCO framework. Because this framework was made by the EU it is available in several languages; useful for job ads across different countries. The first version of our model used Elasticsearch to match word stems efficiently while losing out on word order. We manually labelled a 200 job ad set to evaluate our model. After qualitative inspection we discovered that the complexity in job ad texts was too large for simple word matching. The second version used OpenAI's LLM (chatgpt-3.5-turbo) in combination with a vector database of embedded green skills as a knowledge base to refer to when deciding whether a job ad text contained the green skills, making the job ad green. This version performed significantly better and was great for demos as it explained why the job ad was green and linked to the specific top three green skills found. Our final version was a lightweight version to save on (costs, runtime and carbon emissions) used to do the classification on all ~50k job ads. This version only contained the encoder part of the LLM and computed the cosine similariy of the job ad text embeddings and the green skill embedding from the vector DB. If this similarity passed a threshold (optimized through our small test set), the job ad was considered green and the green skill with highest similarity was added as explanation/proof. Finally, we divided all 570 (rather specific) green skills by clustering their embeddings to generate more general groups to gain insights in what area companies or sector may be lacking or on track to reach climate goals.



green jobs job ads natural language processing human resources llm large language model embeddings
Legacy/Mirror Reporting
Data Engineering, Finance
Because of the end-of-lifecycle of one of our client's systems, we implemented datapipelines to recreate the old style of reporting by mapping and transforming the data from the new system to the old system's data model.


For one of the largest third party servicers of the Netherlands for mortgages and consumer credit, we implemented a mirror reporting system to ensure the continuity of their reporting during and after a system migration for which their clients were not ready. Since the reporting was on loans for large financial institutions, the reporting had to be accurate in order to not disrupt the banks' models after migration. Firstly, mapping the data from the new system to the old system's which was done in direct collaboration with content experts at our client. Secondly, these mappings were implemented in SQL and SSIS packages to transform the data in the same way as their normal reporting goes to reduce the impact and need of new infrastructure of our solution. Thirdly, we extensively communicated and validated the results of our mapping with the stakeholders, and made adjustments where necessary.



ETL SSMS SSIS SQL Azure DevOps
Transparent Database Encryption
Database Administration, Cybersecurity
For a governmental client we implemented Transparent Database Encryption to secure their data-at-rest against physical theft.


Regulations required the implementation of cybersecurity measures in multiple databases being used by the client's applications. In two rounds I implemented the encryption in SSMS on the client's test environment and subsequently their production environment. The implementation was done live with the client to ensure their understanding of the process and knowledge of the passwords and certificates they would need to manage the encrypted databases themselves. Implementation included the validation of encryption with failover databases.



keywords
Maintenance Data Quality & Imputation
Data Quality
For company that produces, sells, and maintains boilers, we performed a detailed analysis on their data quality and validated and/or imputed missing data.


When a boiler needs maintenance, a mechanic is dispatched to the location of the boiler with information on the specific boiler and the maintenance that needs to be performed. The company has a large database of boilers and maintenance records. The data quality of this database was unknown and the company wanted to know if the data was reliable and if not, how to improve it. We performed a detailed analysis on the data quality of the database and validated and/or imputed missing data. As the origin of the data was mostly from input by the mechanics, small issues such as typos and laziness were the probable causes. We used a combination of error correction, statistical methods and business ruling to perform this validation and imputation. Besides the ad hoc improvements, we also advised the company how to implement a more structured way of inputting data to prevent these issues all together.



boiler data quality imputation error correction levenshtein
Classifying Relevant Drug Targets
Artificial Intelligence, Natural Language Processing, Information Extraction, Life & Health Science
In my MSc. thesis I researched the use of contextual embeddings to classify relevant drug targets in chemical patents w.r.t. the invention of the patent.


In commercial research and development projects, new chemical entities are often disclosed in patent applications, primarily providing information on drug compounds and their targets. However, only a small percentage of these interactions are found in scientific literature within a year, with an average delay of 1 to 3 years before being published. Extracting information about drug targets from patents is challenging due to the obscure language and mentions of unrelated genes and proteins. Manual annotation is currently required for this process, making it time-consuming and expensive. To address the growing volume of biomedical and chemistry-related publications, automatic approaches, including NLP and text-mining, have become popular. Relevancy scoring aims to predict the importance of an entity to a document based on signals from its context and domain knowledge. The automatic extraction of information from chemical patents faces challenges, including the absence of a suitable dataset for training deep learning models. Patents' textual complexity and obfuscating language contribute to the limited interest in this field compared to scientific publications. This work focused on predicting the relevancy of drug targets in patents, defining relevant entities as genes or proteins targeted by at least one drug or chemical compound mentioned in the patent.



drug targets information extraction chemical patents embeddings transformer lstm