The amount of data in the world, the form these data take, and the ways to interact with data have all increased exponentially in recent years. The extraction of useful knowledge from data has long been one of the grand challenges of computer science, and the dawn of "big data" has transformed the landscape of data storage, manipulation, and analysis. In this module, we will look at the tools used to store and interact with data.
The objective of this class is that students gain:
- First hand experience with and detailed knowledge of computing models, notably cloud computing
- An understanding of distributed programming models and data distribution
- Broad knowledge of many databases and their respective strengths
As a part of the Data and Decision Sciences Master's program, this module aims specifically at providing the tool set students will use for data analysis and knowledge extraction using skills acquired in the Algorithms of Machine Learning and Digital Economy and Data Uses classes.
The class is structured in three parts:
20 hours on the computing platforms used in the data ecosystem. We will briefly cover cluster computing and then go in depth on cloud computing, using Google Cloud Platform as an example. Finally, a class on GPU computing will be given in coordination with the deep learning section of the AML class.
20 hours on the distribution of data, with a focus on distributed programming models. We will introduce functional programming and MapReduce, then use these concepts in a practical session on Spark. Finally, students will do a graded exercise with Dask.
In the final 10 hours of the course, state-of-the-art databases will be presented. Students will install and demonstrate the advantages of different databases to their peers as a graded project.
Class dates are subject to change. Please refer to Hyerplng for detailed scheduling.
Data Computation | |||
---|---|---|---|
Introduction to data computation | 2h | 29/09/2020 | Global Datasphere |
Cluster Computing | 2h | 07/10/2020 | SLURM |
Cloud Computing | 2h | 07/10/2020 | Cloud computing |
Virtualisation & Containerisation | 4h | 14/10/2020 | |
Orchestration | 4h | 20/10/2020 | |
GPU computing, part 1 | 3h | 01/12/2020 | |
GPU computing, part 2 | 3h | 02/12/2020 |
Data Distribution | |||
---|---|---|---|
Data distribution | 1h | 06/01/2021 | Spanner |
Functional programming | 4h | 06/01/2021 | Julia |
MapReduce and HDFS | 3h | 19/01/2021 | MapReduce |
Spark | 3h | 19/01/2021 | Spark |
PySpark | 3h | 20/01/2021 | PySpark |
Dask project | 6h | 27/01/2021 | Dask |
Databases | |||
---|---|---|---|
Databases overview | 2h | 03/02/2021 | |
PostgeSQL TP | 3h | 08/02/2021 | PostgeSQL |
Project overview | 2h | 15/02/2021 | |
Project presentations | 2h | 08/03/2021 |