GitHub - ledinhduy281/MapReduce-Implementation-in-PySpark: In this work, we will perform basic data processing task using PySpark.

MapReduce Implementation using PySpark in Google Colab

In this work, we will perform basic data processing task using PySpark.

The Amazon_Responded_Oct05.csv contatins information of 400K tweets. The following 3 columns will be used for this implementation:

Tasks:

Specifically, we need to do the following steps:

Read/load data
Extract the columns (user_id_str and user_followers_count and text_)
Remove the duplicated user id: some users have different number of followers in different rows. In this case, we will just keep the maximum number of followers for a particular user.
Find popular users: create a filter to find popular users who have more than 5000 followers using the new pairs in step 3.
Count words frequency: count words frequency of of the tweets posted by the popular users we get from step 4, and get the Top 10 most popular words and their words frequency

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitattributes		.gitattributes
Amazon_Responded_Oct05.csv		Amazon_Responded_Oct05.csv
MapReduce_PySpark.ipynb		MapReduce_PySpark.ipynb
Output.txt		Output.txt
README.md		README.md