The assignment is to implement a paralleized application leveraging the University of Melbourne HPC facility SPARTAN. The application will use a large Twitter dataset and a file containing suburbs, locations and Greater Capital cities of Australia.
The project objectives to:
- count the number of different tweets made in the Greater Capital cities of Australia,
- identify the Twitter accoutns (users) that have made the most tweets, and
- identify the users that have tweeted from the most different Greater Capital cities.
More information, please visit project wiki.
Project Report: Overleaf
Name | Student ID | |
---|---|---|
Sunchuangyu Huang | 1118472 | sunchuangyuh@student.unimelb.edu.au |
Wei Zhao | 1118649 | weizhao1@student.unimelb.edu.au |
COMP90024 Cluster & Cloud Computing - Assignment 1 TwitterAnalyser
| |── data # store raw data
| |── processed # store processed data
| |── result # store output files
|── notebooks # processing and visualisation notebooks
|── scripts # main program scripts
|── slurm # slurm job scripts
|── doc
| |── log # program log file
| |── slurm
| |── stderr # slurm standard error
| |── stdout # slurm standard output
|── requirements.txt # python dependencies
└── README.md
For local testing, run the following commands:
# main.py must in execute mode
mpiexec -n [NUM_PROCESSORS] python main.py -t [TWITTER_FILE] -s [SAL_FILE] -e [EMAIL_TARGET|OPTIONAL]
# to submit a job in spartan hpc, run submission script
./submit.sh
Note, email target has only two valid options: 'rin' / 'eric'.
Main Python dependencies: python=3.7.4
, mpi4py=3.0.4
, polars
, numpy
, pandas
.
If running on spartan, make sure use virtualenv with a python version 3.7.4
due to spartan load mpi4py version 3.0.4
.
# hpc: load module on spartan
module --force purge
module load mpi4py/3.0.2-timed-pingpong
source ~/virtualenv/python3.7.4/bin/activate
# local: create a conda environment
conda env create --name comp90024 --file environment.yml
# install dependencies
pip install numpy pandas 'polars[all]' # or
pip install -r requirements.txt
The bigTwitter.json
contains 2021-07-05
to 2022-12-31
.
Processing time on BigTwitter.json.
Job | Node | Core | Job Wall-Clock Time | CPU Efficiency |
---|---|---|---|---|
46094405 | 1 | 1 | 00:11:01 | 98.34% |
46094406 | 1 | 8 | 00:01:41 | 87.13% |
46094407 | 2 | 4 | 00:01:41 | 87.75% |
Task 1 Question: The solution should count the number of tweets made by the same individual based on the bigTwitter.json file and returned the top 10 tweeters in terms of the number of tweets made irrespective of where they tweeted. The result will be of the form (where the author Ids and tweet numbers are representative).
Rank | Author Id | Number of Tweets Made |
---|---|---|
#1 | 1498063511204760000 | 68,477 |
#2 | 1089023364973210000 | 28,128 |
#3 | 826332877457481000 | 27,718 |
#4 | 1250331934242120000 | 25,350 |
#5 | 1423662808311280000 | 21,034 |
#6 | 1183144981252280000 | 20,765 |
#7 | 1270672820792500000 | 20,503 |
#8 | 820431428835885000 | 20,063 |
#9 | 778785859030003000 | 19,403 |
#10 | 1104295492433760000 | 18,781 |
Task 2 Question: Using the bigTwitter.json and sal.json file you will then count the number of tweets made in the various captical cities by all users. The result will be a table of the form (where the numbers are representative).
For this task, ignore tweets made by users in rural location, e.g. lrnsw (Rural New South Wales), 1rvic (Rural Victoria) etc.
Greater Capital City | Number of Tweets Made |
---|---|
1gsyd | 2,218,689 |
2gmel | 2,284,909 |
3gbri | 878,614 |
4gade | 465,081 |
5gper | 590,045 |
6ghob | 91,112 |
7gdar | 46,772 |
8acte | 214,347 |
9oter | 203 |
Task 3 Question: The solution should identify those tweeters that have a tweeted in the most Greater Capital cities and the number of times they have tweeted from those locations. The top 10 tweeters making tweets from the most different locations should be returned and if there are equal number of locations, then these should be ranked by the number of tweets. Only those tweets made in Greater Capital cities should be counted.
Rank | Author Id | Number of Unique City Locations and #Tweets |
---|---|---|
#1 | 1429984556451389440 | 8 (#1920 tweets - #1879gmel, #13acte, #11gsyd, #7gper, #6gbri, #2gade, #1gdar, #1ghob) |
#2 | 702290904460169216 | 8 (#1231 tweets - #336gsyd, #255gmel, #235gbri, #156gper, #127gade, #56acte, #45ghob, #21gdar) |
#3 | 17285408 | 8 (#1209 tweets - #1061gsyd, #60gmel, #40gbri, #23acte, #11ghob, #7gper, #4gdar, #3gade) |
#4 | 87188071 | 8 (#407 tweets - #116gsyd, #86gmel, #68gbri, #52gper, #37acte, #28gade, #15ghob, #5gdar) |
#5 | 774694926135222272 | 8 (#272 tweets - #38gmel, #37gbri, #37gsyd, #36ghob, #34acte, #34gper, #28gdar, #28gade) |
#7 | 502381727 | 8 (#250 tweets - #214gmel, #10acte, #8gbri, #8ghob, #4gade, #3gper, #2gsyd, #1gdar) |
#6 | 1361519083 | 8 (#266 tweets - #193gdar, #36gmel, #18gsyd, #9gade, #6acte, #2ghob, #1gbri, #1gper) |
#8 | 921197448885886977 | 8 (#207 tweets - #56gmel, #49gsyd, #37gbri, #28gper, #24gade, #8acte, #4ghob, #1gdar) |
#9 | 601712763 | 8 (#146 tweets - #44gsyd, #39gmel, #19gade, #14gper, #11gbri, #10acte, #8ghob, #1gdar) |
#10 | 2647302752 | 8 (#80 tweets - #32gbri, #16gmel, #13gsyd, #5ghob, #4gper, #4acte, #3gade, #3gdar) |
In this project, we explore Amdahl’s Law by using MPI to process a large JSON file. While parallelism can significantly enhance performance, it is essential to consider potential trade-offs in terms of CPU efficiency. As Table 4 indicates, distributing work across multiple cores can reduce job wall-clock time. However, the benefit might diminish when scaling with multiple nodes, due to the increased time required for MPI communication between nodes. Moreover, parallelism may not be suitable for small datasets if a single core can efficiently solve the problem in a short time. Therefore, when designing a parallel program to maximize performance using MPI, programmers need to balance the trade-off between processing CPU efficiency and overall performance.
For complete assignment 1 report, please check overleaf.
The code will be public after Apr, 25th 2023. For @copyright information please refer to MIT License.
2023@Wei & Sunchuangyu