Skip to content

arsenetripard/my_data_science_projects

Repository files navigation

My data science projects

Computer Vision : extract satellite images of airports and identify road defects

The objective is to identify road defects on airports runways. As explained by our coaches at Colas for this project, there are different types of road defects (faïençage, fissure, etc...). I was in charge of the data engineering on this project, which consisted in requesting the orthophotos and delimiting them on the airports. There was quite a lot of geospatial thinking involved, and I got familiar with GetMap requests. Then, we implemented one of HuggingFace's transformers for Computer Vision tasks.

NLP : named-entity recognition (NER) in hospital admission notes

Starting with a dataset containing approximately 500 hospital admission notes, we had to label data and build a NER (also called token-classification) model. The goal is to augment the dataset, and train a model to extract as many of the following indicators for posology : the drug, the form, the dosage, the duration... We relied on HuggingFace pre-trained LLM such as BertForTokenClassification and Camembert I was in charge of building the pipeline and training the models.

QuantumBlack challenge

QuantumBlack's challenge was focused on two computer vision tasks : image classification, and then segmentation. We also built a Streamlit app that encapsulates our CV model, and helps to visualize our business recommendations. I was in charge of building the Streamlit app, and coming up with business recommendations.

Optimizing IT infrastructure

It was a project on time series. For each server's CPU and memory, we had 500 data points corresponding to 500 days. A data point is actually three values: max_value, avg_value and min_value. The objective is to optimize the CPU and memory settings (number of core, mem size) to avoid both saturation and idleness. I was in charge of data processing, more particularly analysing and validating periodicity within the time series. So that we could feed only relevant signals to our Prophet time-series model.

Finding connected components in graph

We implemented an algo called CCF-iterate available here: https://www.cse.unr.edu/~hkardes/pdfs/ccf.pdf We used Scala and Python, and wrote programs using RDD (Resilient Distributed Datasets) or Spark DataFrame and tested both methods on increasingly big graphs. I was in charge of writing Python code.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published