This is an optional project developed during the course "054306 - UNSTRUCTURED AND STREAMING DATA ENGINEERING" during my studies at Politecnico di Milano
The project work consists into analyzing some tweets about the Brexit topic and plotting some diagrams about the most frequent words, taking into account different dimensions like political stance, sentiment and language. The original starting CSV data are available at: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/KP4XRP.
The tweets data are gathered using a multithread python script in order to exploit a parallel (multi-account) interaction with the Twitter APIs. The script extrapolates the most salient words by applying filtering and transformation operations in the middle steps of the elaboration using the nltk library, and then stores the results with different granularities (single-tweet and user-aggregate arrays of tuples <word,count>) in a MongoDB database.
Finally, using several python scripts, are extracted the most frequent used words for the tweets written in English by exploiting a set of MAP-REDUCE queries over the MongoDB repository. The outputs are plots of several graphs that takes into account different parameters like political stance, sentiment and language.
The same kind of analysis is performed for the 4 main European languages (IT,FR,DE,ES), but in this case the output is limited on describing the most used words for each language.
-
install MongoDB (https://docs.mongodb.com/manual/administration/install-on-linux/)
-
unzip the database file ./db/mongoDB_backup/db_compressed.rar
-
import the database on MongoDB using the command from the main directory
mongorestore -d brexit ./db/mongoDB_backup/brexit/ -u Admin -p Password --authenticationDatabase admin
-
add some twitter-api developer keys in the file ./twitter-analyzers/credentials.csv ( the more accounts you use, the higher is the interaction throughput)
-
run ./twitter-analyzers/multithread-tweets-analyzer.py for getting new tweets data
-
run some python scripts inside the ./analysis_scripts/ folder to get some plot