This project performs sentiment analysis on Reddit posts related to the war in Palestine. It leverages a modern data pipeline involving data ingestion, message brokering, stream processing, machine learning for sentiment prediction, and dashboard visualization. The entire setup is containerized using Docker and managed with Docker Compose.
- Reddit API: Source of data, fetching posts related to the war in Palestine.
- Ingestion Script: Python script to fetch data from Reddit API.
- Kafka: Message broker to handle streaming data.
- Apache Spark: For stream processing.
- Fine-tuned BERT Model: Machine learning model to analyze sentiment.
- Cassandra: Database to store processed data.
- Grafana: Dashboard to visualize data.
- Kafdrop: Kafka monitoring tool.
- Docker: Containerization of services.
- FastAPI: API service to expose the prediction model.
- Docker
- Docker Compose
-
Clone the Repository:
git clone https://github.com/yourusername/reddit-sentiment-analysis.git cd reddit-sentiment-analysis
-
Build and Start Services:
docker-compose up --build
-
Accessing Services:
- Kafka Monitoring: http://localhost:9000
- Spark Master: http://localhost:8080
- Model Service: http://localhost:8081
- Grafana Dashboard: http://localhost:3000
- Purpose: Fetches data from Reddit and sends it to Kafka.
- Technology: Python, Kafka
- Key Files:
reddit-producer.py
: Main script to fetch and send data.config.yaml
: Configuration file for Reddit API and Kafka.
- Purpose: Processes streaming data from Kafka.
- Technology: Apache Spark
- Key Files:
spark-streaming.py
: Spark job to process and analyze data.
- Purpose: Provides an API to predict sentiment using a fine-tuned BERT model.
- Technology: FastAPI, PyTorch
- Key Files:
app.py
: FastAPI application.- Model files in
model/
directory.
- Purpose: Visualizes the processed data.
- Technology: Grafana
- Key Files:
grafana.ini
: Configuration file for Grafana.cassandra.yaml
: Datasource configuration for Cassandra.
- Kafdrop: Monitor Kafka topics and brokers at http://localhost:9000
- Grafana: Visualize data and monitor metrics at http://localhost:3000
Feel free to contribute by opening issues or submitting pull requests. For major changes, please open an issue first to discuss what you would like to change.