Language distribution of Tweet

About The Project

This is a group project for COMP90024 Cluster and Cloud Computing (Semester 1, 2022), The University of Melbourne. The objective of this project is to read a big file (20Gb+ JSON file) containing information about tweets and calculate the number of tweets and languages used in each given cell of Sydney. The whole process is divided into two main steps. The first step is reading and extracting information about locations and languages parallelly. The second step is classifying locations into different cells and counting the number of languages used. The final output is in a CSV file with columns including cell name, number of total tweets, number of languages used and top ten languages and their corresponding number of tweets.

Parallelized processing

The primary method to read a big file without crashing the computer is loading some small parts of the file instead of loading all data into memory at one time. Thus, we utilize the “mmap (Memory-mapped file)” module in Python to read one record at a time. Firstly, the mmap constructor is used to open the big twitter file to create a memory-mapped file. A memory- mapped file is a mmap object, and the accountable unit in this file is the byte. The most critical point in this project is that the mmap object has indexes, so it is easy to separate parts for different MPI processes according to corresponding indexes of bytes. The “readline()” method of the mmap object can automatically read one record with an ending of “\n”. It is also an iterative step to read records one by one. As a result, the required memory space is minimal. The computer reads only one complete record into memory in each MPI process in dictionary format for further processing. Then the values of coordinates and language are extracted from their keys. These steps run simultaneously on different MPI processors. Finally, all extracted data are gathered to rank 0. Further cell allocation algorithms will do calculations about these gathered data on the root processor (rank 0).

Cell allocation

We define a class to represent each cell in the grid. The cell class is aware of its coordinates and borders. There are different functions to test whether a given point lies on the valid border or within a cell. To cover different situations for cells on the grid border, we define cells whose left borders or bottom borders are valid. Since the case with points located on the cell vertices is complicated, we define valid vertex points for each cell separately. Each cell has its record of language distributions.

Team members:

(back to top)

Built With

(back to top)

Getting Started

Prerequisites

This is an example of how to list things you need to use the software and how to install them.

Spartan access

should have access to Spartan project
make a symbolic link to Tweet and locaional files to user directory, E.g.

ln –s /data/projects/COMP90024/bigTwitter.json 
ln –s /data/projects/COMP90024/smallTwitter.json 
ln –s /data/projects/COMP90024/tinyTwitter.json 
ln –s /data/projects/COMP90024/sydGrid.json

Installation

Clone the repo

git clone https://github.com/MelodyJIN-Y/Tweet-language-distribution

(back to top)

Usage

Process bigTwitter.json file on Spartan:

case 1: 1 node and 1 core
```
sbatch 1node1core.slurm
```
case 2: 1 node and 8 cores
```
sbatch 1node8core.slurm
```
case 3: 2 nodes and 8 cores (with 4 cores per node)
```
sbatch 2node8core.slurm
```

Files

src folder: slurm files to specify computing recources

main.py: the main multi-processing functions
Utility.py: A class for count language distribution based on defined rules
plot_result.py: a helper function for performance comparisons of different computing resources

slurm folder: slurm files to specify computing recources

1 node and 1 core: 1node1core.slurm
1 node and 8 cores: 1node8core.slurm
2 nodes and 8 cores (with 4 cores per node): 2node8core.slurm

output folder: all the collected results

Resources usage output:
- 1node1core.out
- 1node8core.out
- 2node8core.out
Language ditribution of the bigTwitter.json file:
- bigTwitter_result.csv

(back to top)

License

Distributed under the GNU License. See LICENSE for more information.

(back to top)

Acknowledgments

README template

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
output		output
slurm		slurm
src		src
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language distribution of Tweet

About The Project

Parallelized processing

Cell allocation

Team members:

Built With

Getting Started

Prerequisites

Installation

Usage

Process bigTwitter.json file on Spartan:

Files

src folder: slurm files to specify computing recources

slurm folder: slurm files to specify computing recources

output folder: all the collected results

License

Acknowledgments

About

Releases

Packages

Languages

License

MelodyJIN-Y/Tweet-language-distribution

Folders and files

Latest commit

History

Repository files navigation

Language distribution of Tweet

About The Project

Parallelized processing

Cell allocation

Team members:

Built With

Getting Started

Prerequisites

Installation

Usage

Process bigTwitter.json file on Spartan:

Files

src folder: slurm files to specify computing recources

slurm folder: slurm files to specify computing recources

output folder: all the collected results

License

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages