Skip to content

Latest commit

 

History

History
94 lines (90 loc) · 4.88 KB

README.md

File metadata and controls

94 lines (90 loc) · 4.88 KB

Data Engineering ND #1: Data Modeling With Postgres

Overview

Data Modeling With Postgres is the first project of Udacity Data Engineering Nanodegree. It requires to create a Postgres database for a music streaming app and write an ETL pipeline that transfers data from files in two local directories into this database using Python and SQL.

The project has the following goals:

  • Model a relation database with Postgres for a star schema
  • Define fact and dimension tables
  • Insert data into tables
  • Create ETL pipeline with Python

Repo Contents

The project includes data folder with song and log datasets, as well as the following files:

  • test.ipynb: displays the first few rows of each table to check the database.
  • create_tables.py: drops and creates tables.
  • etl.ipynb: reads and processes a single file from data/song_data and data/log_dataand loads the data into tables. This notebook contains detailed instructions on the ETL process for each of the tables.
  • etl.py: reads and processes files from data/song_data and data/log_data and loads them into tables.
  • sql_queries.py: contains all SQL queries.

Database

Database Purpose

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their music streaming app. The sparkifydb database is designed to optimize queries on song play analysis.

Data Description

The data resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in the app.

Song Dataset data/song_data

This dataset is a subset of real data from Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. Below is an example of what a single song file data/song_data/A/A/B/TRAABJV128F1460C49.json looks like.

{
  "num_songs": 1, 
  "artist_id": "ARIK43K1187B9AE54C", 
  "artist_latitude": null, 
  "artist_longitude": null, 
  "artist_location": "Beverly Hills, CA", 
  "artist_name": "Lionel Richie", 
  "song_id": "SOBONFF12A6D4F84D8", 
  "title": "Tonight Will Be Alright", 
  "duration": 307.3824, 
  "year": 1986
}

Log Dataset data/log_data

This dataset consists of log files in JSON format generated by eventsim event simulator based on the songs in the song dataset. These simulate activity logs from a music streaming app based on specified configurations. Below is an example of what a single line of a single file data/log_data/2018/11/2018-11-09-events.json looks like.

{
  "artist":"Beastie Boys",
  "auth":"Logged In",
  "firstName":"Harper",
  "gender":"M",
  "itemInSession":2,
  "lastName":"Barrett",
  "length":161.56689,
  "level":"paid",
  "location":"New York-Newark-Jersey City, NY-NJ-PA",
  "method":"PUT",
  "page":"NextSong",
  "registration":1540685364796.0,
  "sessionId":275,
  "song":"Lighten Up",
  "status":200,
  "ts":1541722186796,
  "userAgent":"\"Mozilla\/5.0 (Windows NT 6.3; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"",
  "userId":"42"
}

Database Design

The sparkifydb is a Postgres database filled with data from songs and log datasets. It is designed with star schema to optimize queries on song play analysis. The sparkifydb database contains the following tables:

  • Fact table
    • songplays: records in log data associated with song plays, i.e. records with page NextSong
  • Dimension tables
    • users: users in the app
    • songs: songs in music database
    • artists: artists in music database
    • time: timestamps of records in songplays broken down into specific units

The sparkifybd schema is shown below. Primary and foreign keys are marked as PK and FK, respectively. image

Getting Started

To run ETL pipeline locally,

  1. Clone this repo.
  2. cd into project directory.
  3. Run create_tables.py to create database and tables:
pyhton create_tables.py

To confirm the creation of tables with correct columns, run test.ipynb. Make sure to click Restart kernel to close the connection to the database after running this notebook.

  1. Run etl.py to fill the tables:
python etl.py

Remember to run create_tables.py before running etl.py to reset tables.

Authors

Alexandra Baturina