MapReduce Implementation using PySpark in Google Colab

In this work, we will perform basic data processing task using PySpark.

The Amazon_Responded_Oct05.csv contatins information of 400K tweets. The following 3 columns will be used for this implementation:

Tasks:

Specifically, we need to do the following steps:

Read/load data
Extract the columns (user_id_str and user_followers_count and text_)
Remove the duplicated user id: some users have different number of followers in different rows. In this case, we will just keep the maximum number of followers for a particular user.
Find popular users: create a filter to find popular users who have more than 5000 followers using the new pairs in step 3.
Count words frequency: count words frequency of of the tweets posted by the popular users we get from step 4, and get the Top 10 most popular words and their words frequency

Provide feedback