This project involves building a comprehensive, real-time data engineering pipeline focused on processing stock market data using Apache Kafka. The pipeline integrates various tools and technologies to efficiently handle streaming data and perform operations relevant to data engineering.
- Programming Language: Python
- Amazon Web Services (AWS):
- S3 (Simple Storage Service)
- Athena
- Glue Crawler
- Glue Catalog
- EC2
- Apache Kafka for real-time data streaming
- SQL for querying data and analysis
This project architecture leverages Kafka for real-time data ingestion and various AWS services for data storage, cataloging, and querying. It is designed to illustrate a typical data engineering workflow for managing large-scale, streaming data.
The project is adaptable to different datasets, emphasizing the operational aspects of building and managing the data pipeline. Dataset is available in the files section