This repository is for the third Data Engineering Nanodegree project from Udacity. This project implements a Data warehouse model and pipeline using AWS S3 Bucket and Amazon Redshift.
- Understanding the problem to solve
- Modeling the database and pipeline model
- Create the database schema
- ETL development in Python
Build and test an ETL pipeline for a database hosted on AWS Redshift with the data warehouse model. The data need to be load from S3 to staging tables on Redshift and execute SQL statements that create the analytics tables from these staging tables.
The project uses data from Million Song Dataset that is a freely-available collection of audio features and metadata for a million contemporary popular music tracks (300 GB). This data is open for exploration and research and for this project will only use a sample from the songs database and artist information in json format.
- Song dataset:
Json files are under /data/song_data directory. The file format is:
{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
- Log dataset: Json File are under /data/log_data. The file format is:
{"artist":"Slipknot","auth":"Logged In","firstName":"Aiden","gender":"M","itemInSession":0,"lastName":"Ramirez","length":192.57424,"level":"paid","location":"New York-Newark-Jersey City, NY-NJ-PA","method":"PUT","page":"NextSong","registration":1540283578796.0,"sessionId":19,"song":"Opium Of The People (Album Version)","status":200,"ts":1541639510796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"20"}
The data is available in the udacity buckets
Song data: s3://udacity-dend/song_data
Log data: s3://udacity-dend/log_data
Paths pointing to S3 buckets are defined in the dwh.dfg file.
The database will be designed for analytics using Fact and Dimensions tables on a Star Schema architecture, and staging tables to read data from s3 data storage:
Staging Tables
staging_events - Load the raw data from log events json files
artist, auth, firstName, gender, itemInSession, lastName, length, level, location, method, page, registration, sessionId, song, status, ts, userAgent, userId
staging_songs
num_songs artist_id artist_latitude artist_longitude artist_location artist_name song_id title duration year
Fact Table
songplays - records in log data associated with song plays i.e. records with page NextSong
songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
Dimension Tables
users - users in the app: user_id, first_name, last_name, gender, level
songs - songs in music database: song_id, title, artist_id, year, duration
artists - artists in music database: artist_id, name, location, latitude, longitude
time - timestamps of records in songplays broken down into specific units: start_time, hour, day, week, month, year, weekday
The project structure is based on the Udacity's project template:
- test.ipynb Notebook to verify the etl scripts execution
- create_tables.py drops and creates your tables. You run this file to reset your tables before each time you run your ETL scripts
- etl.py reads and processes files from song_data and log_data and loads them into the databse tables
- sql_queries.py contains all the sql queries for create and fill the tables
- README.md provides discussion on your project
- swh.cfg configuration parameters (Connection strindg and file paths)
The ETL process is developed in the etl.py script. Data is load from the JSON files first to the staging tables from the json files (Songs and events). Using the Redshift services execute the data copy to the staging tables and then executes the data extraction to the fact and dimensions tables.
A. Components required
- AWS amazon account
- User created on IAM AWS and administrative role to connect from remote connection
- Jupyter notebooks environment available
- Python packages: psycopg2 and python-sql
B Running the pipeline
- Clone the repository
- Create IAM role and user
- Create the Redshift cluster and get the connection data
- Configure the connection values in the dwh.cfg file
- Run create_tables.py (Drop tables and create again)
- Run etl.py (Run the ETL process)
- Run test.ipynb notebook to validate the data. (Execute some selects and counts)
Fernando Bonilla linkedin