Skip to content

An automated data pipeline + COVID19 comparison dashboard.

License

Notifications You must be signed in to change notification settings

jjjchens235/covid-compared

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Engineering: COVID19 Comparison Dashboard

A covid location comparison dashboard pulling data from an automated data pipeline.

dashboard

Table of Contents

Table of Contents
  1. About The Project
  2. Data Pipeline Architecture
  3. Prerequisites
  4. Running Project
  5. Inspiration

About The Project

The main purpose of this data engineering project is to create an end to end automated data pipeline that will allow the end user to be able to easily query on three different covid fact metrics - confirmed, deaths, and recovered- across a number of different dimensions such as location and time.

In order to showcase my work, I have also created a live covid dashboard that allows for regional comparisons across the aforementioned metrics, you can access it here.

Notable dashboard features:

  • Allows comparisons across thes same location level, i.e Beijing, China vs Montana, United States
  • Offers per capita and absolute numbers
  • Top 5 most affected locations

Built With

Data Pipeline Architecture

flow chart

Pipeline steps

  1. John Hopkins data -> S3 using Python pandas
    1. The John Hopkins repo organizes its time series files by metric (confirmed, deaths, recovered) and by location (global, and US).
    2. Each of these 5 files are cleaned using pandas and then staged to S3 using Python s3fs library.
  2. SQL scripts that:
    1. extract the data from S3 to RDS.
    2. transform the data into a star schema.
      1. fact table contains the 3 metrics (confirmed, deaths, recovered).
      2. the dimension tables are location, date, and coordinates.
    3. From the fact tables, create final bi tables (county, state, and country) that our end application will query from.
  3. The end user can now access the updated data through the Dash app- deployed using Heroku and AWS Route 53.

Visualized as a DAG

dag

Prerequisites

Data Pipeline Prerequisites

  1. Docker
  2. docker-compose
  3. AWS account for S3 and RDS

Dash App Prerequisites

Local

  1. Connection to RDS database set-up from data pipeline
  2. Python and all the Python libraries in requirements.txt

Server

  1. All local prereqs
  2. Heroku account
  3. Need to install the requirements.txt file in a virtual environment (more instructions below)

Running Project

Running data pipeline

  • If first time running, add credentials to pipeline/dags/config/aws_config.json. An example file is provided for reference.

  • To build project run

docker-compose up -d
  • To turn off airflow servers and delete Docker containers
docker-compose down

Running dash app

Running Locally

If first time running, copy the example file, and enter your credentials:

cd covid_compared/
cp dash_app/config/dash_credentials_example.json dash_app/config/dash_credentials.json
vi dash_app/config/dash_credentials.json

To deploy dash app locally run:

python app_dash.py

Deploying on Heroku

If first time deploying, create a new app in Heroku, and add the environment variables from the example file and their corresponding values into Heroku Dashboard -> Settings tab. More detailed instructions on adding env variables to Heroku here.

To actually deploy, follow these instructions. The only difference is that this Dash app is in a sub-folder, so when pushing, run this instead:

cd covid_compared/
git subtree push --prefix dash_app/ heroku master

AWS Lambda

This app was originally deployed on AWS Lambda using zappa. Lambda's one flaw was significant initial website load times (up to 15s) due to cold starts.

Switching to Heroku ended up being really easy thankfully, but I left the 'zappa' branch up in case someone wants to refer to it as there aren't many examples of Dash + zappa deployments.

Inspiration

I chose this specific topic because when I went on a trip in Fall 2020 from Southern California- where I live- to Dallas, I remember being curious about which location I was more likely to get covid in. Getting the per capita number of cases for both locations manually was a pain, and as such this project was borne.

From a data engineering project perspective, I was inspired by this Meetup Analytics DE project.

About

An automated data pipeline + COVID19 comparison dashboard.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published