Music Streaming App ETL and NoSQL distributed Data Modeling for Analytics using Apache Cassandra and Docker
In this project, I'll apply what you've learned on data modeling with Apache Cassandra and complete an ETL pipeline using Python.
For this project, I'll be working with one dataset: event_data. The directory of CSV files is partitioned by date. Here are examples of file paths to two files in the dataset:
event_data/2018-11-08-events.csv event_data/2018-11-09-events.csv
- I will process the event_datafile_new.csv dataset to create a denormalized dataset
- I will model the data tables keeping in mind the analytical queries to be run
- I will load the data into tables I create in Apache Cassandra and run my queries
- Implement the logic in section Part I of the notebook to iterate through each event file in event_data to process and create a new CSV file in Python
- Part II of the notebook to include Apache Cassandra CREATE and INSERT statements to load processed records into relevant tables in my data model
- Design tables to answer the queries outlined in the project
- Write Apache Cassandra CREATE KEYSPACE and SET KEYSPACE statements
- Develop CREATE statement for each of the tables to address each question
- Load the data with an INSERT statement for each of the tables
- Include IF NOT EXISTS clauses in CREATE statements to create tables only if the tables do not already exist.
- Include a DROP TABLE statement for each table, this way I can drop and create tables whenever I want to reset your database and test my ETL pipeline
- Test by running the proper select statements with the correct WHERE clause
Docker is used to:
- Develop the data servers using the Cassandra image
- Setup Jupyter notebook to use throughout the project
image: cassandra:4.1.0 setup environment variable to mimic the Apache Cassandra database server
image: jupyter/datascience-notebook:x86_64-ubuntu-22.04 setup Jupyter notebook to use throughout the project
Use the two containers in a mutual network to access and load the data from and two the database using jupyter notebook and python
- docker-compose.yml: set up docker and configurations
- etl.py: etl script
- sql_queries: contain the SQL queries used in the project
- requirement.txt: required packages for the project
- NOSQL Data Modeling and ETL pipeline With Apache Cassandra: Project notebook
- datasets dir: contain datasets used in the project compressed, unzip it to use
- data dir: used to mount volumes of the container to make the data storage persist whenever I run the container
- images: contain image file used in the jupyter notebook
- clone the repository
- run
docker compose up
- go to the jupyter server
- open the terminal and run
pip install requirements
- run the notebook
- run
python etl.py