For a Big Data course, we had to handle big datasets while also dealing with the NLP problematic.
Our goals were :
- Checking PySpark scalability with large datasets
- Observe the benefits of data distribution on our processes
- Ensure satisfactory sentiment prediction results
- Sentiment140
- Custom dataset fetched from Twitter public API
PySpark - Spark Streaming - Kafka