In this work, we will perform basic data processing task using PySpark.
The Amazon_Responded_Oct05.csv contatins information of 400K tweets. The following 3 columns will be used for this implementation:
-
user_id_str: user ID
-
user_followers_count: the number of followers
-
text_: the text of tweets
Tasks:
-
Find out popular users whose followers are more than 5000, and
-
Get Top 10 most popular words from the tweets posted by these popular users
Specifically, we need to do the following steps:
-
Read/load data
-
Extract the columns (user_id_str and user_followers_count and text_)
-
Remove the duplicated user id: some users have different number of followers in different rows. In this case, we will just keep the maximum number of followers for a particular user.
-
Find popular users: create a filter to find popular users who have more than 5000 followers using the new pairs in step 3.
-
Count words frequency: count words frequency of of the tweets posted by the popular users we get from step 4, and get the Top 10 most popular words and their words frequency