Skip to content

Datawrangling example with Python and Dockerized Postgres

Notifications You must be signed in to change notification settings

LarsV123/datawrangling

Repository files navigation

Datawrangling example

This repository is a toy example done for educational purposes, in order to practice using Python, Docker and SQL.

My goals for the project were:

  • Set up a local Postgres database with Docker
  • Clean and insert a large dataset into the database
  • Query the database to solve specific tasks
  • Do all of the above with clean code, reproducible steps and a command-line utility (CLI)

Setup

Below are the steps required to run this project. Prequisites:

  • Python (v3.9.6 used during testing)
  • Docker

Virtual environment

Using a virtual environment (venv) is recommended, but not necessary:

  1. py -m venv venv
  2. Activate venv in IDE. Run pip list to check that you are in the correct venv, it should only have pip and setuptools installed by default.
  3. pip install -r requirements.txt

Dataset

  1. Get a copy of the raw dataset:

  2. Extract dataset.zip, for example to /Data

Environment variables

Make a copy of the .env-template file and rename it .env

  1. Supply it with credentials. These will be used both when setting up the database and when accessing it.
  2. Specify where you placed the dataset DATASET_PATH should point to the parent directory of /000, /001 etc.

Running queries

Before running queries, start the database container with docker-compose up. The data insertion and querying can then be by calling main.py from a separate terminal. Run py main.py --help for more detailed instructions.

Code style

The project is set up to use Black for automatic formatting. Either set up your IDE to use this automatically, or run Black manually with black.

About

Datawrangling example with Python and Dockerized Postgres

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages