Music Streaming App ETL and NoSQL distributed Data Modeling for Analytics using Apache Cassandra and Docker

Modeling with Cassandra

In this project, I'll apply what you've learned on data modeling with Apache Cassandra and complete an ETL pipeline using Python.

Datasets

For this project, I'll be working with one dataset: event_data. The directory of CSV files is partitioned by date. Here are examples of file paths to two files in the dataset:

event_data/2018-11-08-events.csv event_data/2018-11-09-events.csv

Note: the data is compressed and placed inside datasets dir

The project template includes one Jupyter Notebook file, in which:

I will process the event_datafile_new.csv dataset to create a denormalized dataset
I will model the data tables keeping in mind the analytical queries to be run
I will load the data into tables I create in Apache Cassandra and run my queries

Project Steps

Step 1: Build ETL Pipeline

Implement the logic in section Part I of the notebook to iterate through each event file in event_data to process and create a new CSV file in Python
Part II of the notebook to include Apache Cassandra CREATE and INSERT statements to load processed records into relevant tables in my data model

Step 2: Modeling NoSQL database for Apache Cassandra database

Design tables to answer the queries outlined in the project
Write Apache Cassandra CREATE KEYSPACE and SET KEYSPACE statements
Develop CREATE statement for each of the tables to address each question
Load the data with an INSERT statement for each of the tables
Include IF NOT EXISTS clauses in CREATE statements to create tables only if the tables do not already exist.
Include a DROP TABLE statement for each table, this way I can drop and create tables whenever I want to reset your database and test my ETL pipeline
Test by running the proper select statements with the correct WHERE clause

Docker

Docker is used to:

Develop the data servers using the Cassandra image
Setup Jupyter notebook to use throughout the project

Image Details

image: cassandra:4.1.0 setup environment variable to mimic the Apache Cassandra database server

image: jupyter/datascience-notebook:x86_64-ubuntu-22.04 setup Jupyter notebook to use throughout the project

Use the two containers in a mutual network to access and load the data from and two the database using jupyter notebook and python

Files:

docker-compose.yml: set up docker and configurations
etl.py: etl script
sql_queries: contain the SQL queries used in the project
requirement.txt: required packages for the project
NOSQL Data Modeling and ETL pipeline With Apache Cassandra: Project notebook
datasets dir: contain datasets used in the project compressed, unzip it to use
data dir: used to mount volumes of the container to make the data storage persist whenever I run the container
images: contain image file used in the jupyter notebook

How to run the project:

clone the repository
run docker compose up
go to the jupyter server
open the terminal and run pip install requirements
run the notebook
run python etl.py

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
data		data
datasets		datasets
images		images
README.md		README.md
docker-compose.yml		docker-compose.yml
etl.py		etl.py
requirements.txt		requirements.txt
sql_queries.py		sql_queries.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Music Streaming App ETL and NoSQL distributed Data Modeling for Analytics using Apache Cassandra and Docker

Modeling with Cassandra

Datasets

Note: the data is compressed and placed inside datasets dir

The project template includes one Jupyter Notebook file, in which:

Project Steps

Step 1: Build ETL Pipeline

Step 2: Modeling NoSQL database for Apache Cassandra database

Docker

Image Details

Use the two containers in a mutual network to access and load the data from and two the database using jupyter notebook and python

Files:

How to run the project:

About

Releases

Packages

Languages

mmosad19419/Music-Streaming-App-ETL-and-NoSQL-distributed-Data-Modeling-using-Apache-Cassandra-and-Docker

Folders and files

Latest commit

History

Repository files navigation

Music Streaming App ETL and NoSQL distributed Data Modeling for Analytics using Apache Cassandra and Docker

Modeling with Cassandra

Datasets

Note: the data is compressed and placed inside datasets dir

The project template includes one Jupyter Notebook file, in which:

Project Steps

Step 1: Build ETL Pipeline

Step 2: Modeling NoSQL database for Apache Cassandra database

Docker

Image Details

Use the two containers in a mutual network to access and load the data from and two the database using jupyter notebook and python

Files:

How to run the project:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages