Realtime Data Streaming | End-to-End Data Engineering Project

Overview

In today's fast-paced, data-driven world, real-time data streaming is crucial for handling large volumes of data efficiently and making time-sensitive decisions. Whether it's live updates, monitoring system events, or analyzing clickstreams, businesses rely on the ability to collect, process, and store data as it flows in real time.

To explore this further, I developed an end-to-end data engineering pipeline that automates the data ingestion, processing, and storage lifecycle using a scalable, modern tech stack. This project leverages a variety of technologies to streamline data workflows, making it ideal for both real-time and batch processing use cases.

System Architecture

The project is designed with the following components:

Data Source: We use randomuser.me API to generate random user data for our pipeline.
Apache Airflow: Responsible for orchestrating the pipeline and storing fetched data in a PostgreSQL database.
Apache Kafka and Zookeeper: Used for streaming data from PostgreSQL to the processing engine.
Control Center and Schema Registry: Helps in monitoring and schema management of our Kafka streams.
Apache Spark: For data processing with its master and worker nodes.
Cassandra: Where the processed data will be stored.
Doker: For Containerizing our entire pipeline.

We can monitor these messages being sent to Kafka topic using Control Center.

Technologies

Apache Airflow
Python
Apache Kafka
Apache Zookeeper
Apache Spark
Cassandra
PostgreSQL
Docker

Getting Started

Clone the repository:

git clone https://github.com/SidiahmedHABIB/e2e-data-engineering.git

Navigate to the project directory:
```
cd e2e-data-engineering
```

Install packages:

pip install airflow
pip install kafka-python
pip install spark pyspark
pip install cassandra-driver

Run Docker Compose to spin up the services:
```
docker-compose up
```

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Realtime Data Streaming | End-to-End Data Engineering Project

Table of Contents

Overview

System Architecture

The project is designed with the following components:

We can monitor these messages being sent to Kafka topic using Control Center.

Technologies

Getting Started

License

My Links

Files

README.md

Latest commit

History

README.md

File metadata and controls

Realtime Data Streaming | End-to-End Data Engineering Project

Table of Contents

Overview

System Architecture

The project is designed with the following components:

We can monitor these messages being sent to Kafka topic using Control Center.

Technologies

Getting Started

License

My Links