This repository is a toy example done for educational purposes, in order to practice using Python, Docker and SQL.
My goals for the project were:
- Set up a local Postgres database with Docker
- Clean and insert a large dataset into the database
- Query the database to solve specific tasks
- Do all of the above with clean code, reproducible steps and a command-line utility (CLI)
Below are the steps required to run this project. Prequisites:
- Python (v3.9.6 used during testing)
- Docker
Using a virtual environment (venv) is recommended, but not necessary:
py -m venv venv
- Activate venv in IDE. Run
pip list
to check that you are in the correct venv, it should only have pip and setuptools installed by default. pip install -r requirements.txt
-
Get a copy of the raw dataset:
- Download data from https://www.microsoft.com/en-us/research/publication/geolife-gps-trajectory-dataset-user-guide/
- Place the Data folder in the project root folder, or point the environment variable to its location
-
Extract
dataset.zip
, for example to/Data
Make a copy of the .env-template
file and rename it .env
- Supply it with credentials. These will be used both when setting up the database and when accessing it.
- Specify where you placed the dataset
DATASET_PATH
should point to the parent directory of/000
,/001
etc.
Before running queries, start the database container with docker-compose up
.
The data insertion and querying can then be by calling main.py
from a separate terminal.
Run py main.py --help
for more detailed instructions.
The project is set up to use Black for automatic formatting.
Either set up your IDE to use this automatically, or run Black manually with black
.