Skip to content

bryanvanhuyneghem/Distributed-Data-Processing

Repository files navigation

Distributed Data Processing

Note: please view the notebook or HTML file rather than the PDF file. The PDF file does not include the visualisations.

Getting started

Note: Please follow the installation docs for the GDV course first.

  1. Clone this repository to your computer using git.

    git clone https://github.com/bryanvanhuyneghem/Distributed-Data-Processing.git
  2. Download your assigned datasets to this folder of the repository.

  3. Add all the files of the dataset to the .gitignore file so that it does not get added to the git repository. For more information on gitignore files, see the git docs.

  4. Open project.code-workspace using Visual Studio Code.

    Note: If you're working on Windows, make sure that your Docker instance is running.

  5. Click on the "Remote Explorer" tab in the left sidebar.

    1. Click on the next to CONTAINERS,
    2. choose "Open Current Folder in Container",
    3. choose "Python 3 - Anaconda". This will create a container to develop in.
  6. Wait until the container is setup. This can take a few minutes because the container needs to be pulled and built. You can check the progress by clicking "Starting Dev Container (show log)" in the notification on the bottom right of VSCode.

  7. When the container is setup, open lab1-project.ipynb and start coding!