The main purpose of this data engineering project is to create an end to end automated data pipeline that will allow the end user to be able to easily query on three different covid fact metrics - confirmed, deaths, and recovered- across a number of different dimensions such as location and time.

In order to showcase my work, I have also created a live covid dashboard that allows for regional comparisons across the aforementioned metrics, you can access it here.

Notable dashboard features:

Allows comparisons across thes same location level, i.e Beijing, China vs Montana, United States
Offers per capita and absolute numbers
Top 5 most affected locations

Built With

Data Pipeline Architecture

Pipeline steps

John Hopkins data -> S3 using Python pandas
1. The John Hopkins repo organizes its time series files by metric (confirmed, deaths, recovered) and by location (global, and US).
2. Each of these 5 files are cleaned using pandas and then staged to S3 using Python s3fs library.
SQL scripts that:
1. extract the data from S3 to RDS.
2. transform the data into a star schema.
  1. fact table contains the 3 metrics (confirmed, deaths, recovered).
  2. the dimension tables are location, date, and coordinates.
3. From the fact tables, create final bi tables (county, state, and country) that our end application will query from.
The end user can now access the updated data through the Dash app- deployed using Heroku and AWS Route 53.

Visualized as a DAG

Prerequisites

Data Pipeline Prerequisites

Docker
docker-compose
AWS account for S3 and RDS

Dash App Prerequisites

Local

Connection to RDS database set-up from data pipeline
Python and all the Python libraries in requirements.txt

Server

All local prereqs
Heroku account
Need to install the requirements.txt file in a virtual environment (more instructions below)

Running Project

Running data pipeline

If first time running, add credentials to pipeline/dags/config/aws_config.json. An example file is provided for reference.
To build project run

docker-compose up -d

After the airflow servers are up
- go to http://localhost:8080/
  - username: admin
  - pw: password

To turn off airflow servers and delete Docker containers

docker-compose down

Running dash app

Running Locally

If first time running, copy the example file, and enter your credentials:

cd covid_compared/
cp dash_app/config/dash_credentials_example.json dash_app/config/dash_credentials.json
vi dash_app/config/dash_credentials.json

To deploy dash app locally run:

python app_dash.py

Deploying on Heroku

If first time deploying, create a new app in Heroku, and add the environment variables from the example file and their corresponding values into Heroku Dashboard -> Settings tab. More detailed instructions on adding env variables to Heroku here.

To actually deploy, follow these instructions. The only difference is that this Dash app is in a sub-folder, so when pushing, run this instead:

cd covid_compared/
git subtree push --prefix dash_app/ heroku master

AWS Lambda

This app was originally deployed on AWS Lambda using zappa. Lambda's one flaw was significant initial website load times (up to 15s) due to cold starts.

Switching to Heroku ended up being really easy thankfully, but I left the 'zappa' branch up in case someone wants to refer to it as there aren't many examples of Dash + zappa deployments.

Inspiration

I chose this specific topic because when I went on a trip in Fall 2020 from Southern California- where I live- to Dallas, I remember being curious about which location I was more likely to get covid in. Getting the per capita number of cases for both locations manually was a pain, and as such this project was borne.

From a data engineering project perspective, I was inspired by this Meetup Analytics DE project.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
dash_app		dash_app
images		images
pipeline		pipeline
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering: COVID19 Comparison Dashboard

Table of Contents

About The Project

Built With

Data Pipeline Architecture

Pipeline steps

Visualized as a DAG

Prerequisites

Data Pipeline Prerequisites

Dash App Prerequisites

Local

Server

Running Project

Running data pipeline

Running dash app

Running Locally

Deploying on Heroku

AWS Lambda

Inspiration

About

Releases

Packages

Languages

License

jjjchens235/covid-compared

Folders and files

Latest commit

History

Repository files navigation

Data Engineering: COVID19 Comparison Dashboard

Table of Contents

About The Project

Built With

Data Pipeline Architecture

Pipeline steps

Visualized as a DAG

Prerequisites

Data Pipeline Prerequisites

Dash App Prerequisites

Local

Server

Running Project

Running data pipeline

Running dash app

Running Locally

Deploying on Heroku

AWS Lambda

Inspiration

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages