Project Portfolio

Innovative solutions I helped my clients achieve

Tracking Technology Adoption Around the Globe

We created a real-time connected database (Knowledge Graph) to track the adoption of technologies by top companies. The information in this database was extracted fully automatically from unstructured text data.

Turning raw text into a living network allowed an unprecedented level of big-picture analysis on the technology landscape.

Knowledge Graphs | Information Extraction | Deep Syntax Analysis

The aim of the project was to extract relationships between companies and technologies based on transcripts of Earnings Calls. The three major technical challenges were the high level of noise (the large majority of technology mentions in the data are meaningless), the lack of labeled data and a requirement of interpretability and robustness.

The solution we found was to use an OpenIE model based on dependency parsing to extract assertions and only keep those which expressed an explicit relationship between the company and the technology. Each assertion became an edge between two nodes in the Knowledge Graph.

Then, a classification model (based on the TARS model, trained using transfer learning with weak supervision) classified each edge, thus allowing for various types of edges between the nodes.

Tools and Libraries: Flair | Spacy | Vis.js

Detecting and Predicting Logistics Incidents

The client needed an incident detection system that would have a 0% false-negative rate - meaning it knew when to call upon a human for any ambiguous case. We were able to achieve this thanks to a rigourous data processing, data quality control and monitoring pipeline.

We also used the client's historical data to build predictive models and anticipate incidents before they happen, allowing the client to gain a competitive edge by having predictions tailored to their sector.

Data Quality Control | Database Design | Machine Learning | Data Engineering

To build a truly reliable system, we needed to ensure the quality of the incoming data, by developing a robust processing pipeline for the data provided by the suppliers. This processing chain was based on :

  • A data schema developed for the project (6 new tables were added to the database).
  • A Python library providing an abstraction layer to interact with the raw data.
  • 5 atomic workers developed in Python to implement the different processing steps. This separation of tasks was designed to create loose coupling between the workers and maximize desirable properties such as involutivity and statelessness.

Additionally, we built robust, interpretable predictive models with automatic monitoring and retraining as new data comes in.

Tools and Libraries: Python | MySQL | Docker | FastAPI | SQLAlchemy | Scikit-Learn | Pandas | Numpy

Optimizating Neural Networks for Large-scale Inference

Using state-of-the-art neural network optimization methods , optimizing data flow and minimizing idle time, we managed to divide the cost of inference by 6 for large NLP models.

We conducted an extensive study exploring a large spectrum of techniques and providing guidance for how to optimize future models at the company.

MLOps | NLP | Deep Learning | Transformers

Techniques evaluated for this project include quantization, distillation, pruning, graph optimization, and several backends such as ONNX and TensorRT.

Tools and Libraries: PyTorch | ONNX | TensorRT | Docker

Monitoring Tree Growth From Space

We built a prototype to remotely measure the growth of trees for a forestry company based in Uruguay. Using freely-available Radar Satellite images, we could go from measuring their size by hand once every few years, to having an estimate every 6 days.

This project was done in close partnership with a Computer Vision researcher from the Borelli Research Center.

Remote Sensing | Computer Vision | Time Series

The main technical challenge was that, past a certain age, trees become too dense for their volume to be accurately measured using radar signals (a problem known as backscatter saturation). Our goal was to quantify at what age this saturation would be reached depending on the radar frequency used (C-band or L-band) .

By combining ground-truth measurements provided by the company, Sentinel-1 satellite images, daily weather data for the area, and the scientific literature on tree biology, we developed a method comprising a normalization scheme and a three-scale analysis to assess the presence of backscatter saturation at various tree ages, and provided estimates for the saturation age : 2.5 years for the C band, 6.2 for the L band.

Tools and Libraries: Rasterio | GDAL | Scikit-learn

Simulating a Social Network

We built a unique algorithm for simulating user messages on a social network. The text content of the posts was machine-generated using a novel algorithm allowing for unprecedented control over the nature, theme, and emotion of the text.

This system powered an innovative immersive experience, which was used in a crisis management course at a top French university.

Controllable Text Generation | BERT | Deep Learning

The project implied generating text in French, which was not possible using existing models. Pretrained causal language models (such as GPT-2), which allow for text generation, did not exist in French at the time (and the training budget for such a model was too high for us) . Furthermore, the generation process needed to be highly controllable (in terms of the emotion conveyed or the subjects mentioned) , and have high quality (realistic sentences) and high diversity (no two generated sentences should be alike) .

To this end, I reproduced a method from the literature to generate text using masked language models instead of causal ones. The method suffered from a number of shortfalls and was not controllable. I proposed, implemented and tested 6 major improvements to the method which allowed for much higher-quality outputs as well as high controllability.

Tools and Libraries: Pytorch | Transformers | Spacy