Datacamp class for master student - 5 days
The aim of this course is to learn data science by doing. All aspects of completing a data science pipeline will be covered, from exploratory data analysis (EDA), feature engineering, parameter optimization to advanced learning algorithms. You will also need to setup your own challenge!
Grade is a mix of your performance on the data challenge offered to the class as well as the challenge you will setup.
Each day you will have 50% of lectures and 50% of work on the competitive challenge using the RAMP website.
The slides used in some of the lectures are available here.
- Alexandre Gramfort (alexandre.gramfort@inria.fr)
- Thomas Moreau (thomas.moreau@inria.fr)
- Pedro L. C. Rodrigues (pedro.rodrigues@inria.fr)
The course will be during the week from Dec 18 to Dec 22 in person.
To join the discord channel use this URL.
On GitHub you have some of the teaching materials at: https://github.com/x-datascience-datacamp
You must have a GitHub account to complete the course.
We will be using many Python packages in this course such as pandas
,
sklearn
, and matplotlib
, and they can all be downloaded and installed using
a package-management system. We recommend you to use mamba
but you will be fine if you already have conda
installed in your computer.
NB: Windows users should be sure to closely follow the instructions for
installing mamba
and conda
, since many common problems come from not having
properly setup the PATH
variable for the system.
- Introduction to the workflow (VSCode, git, github, tests, ...)
- Advanced course on Pandas
- Github assignments: numpy and pandas
- Advanced scikit-learn: Column transformer and pipelines
- Parallel processing with joblib
- Generalization and Cross Validation
- Assignment sklearn
- Getting started on RAMP & Introduction to the challenges.
- Presentation of the different ML metrics
- Problem of the metric with imbalanced data
- ML approaches to deal with imbalanced data
- Working on data challenges
- Feature engineering and advanced encoding of categorical features
- Model inspection: Partial dependence plots, Feature importance
- Working on data challenges
- From trees to gradient boosting
- Profiling with snakeviz
- Hyperparameter optimization
- Working on data challenges
This class is teached in the context of the Master Data Science at Institut Polytechnique de Paris.
It receives support from Hi!Paris and DataIA.