Skip to content

Academic project, using Apache Spark for ETL and Data Studio for data analysis.

License

Notifications You must be signed in to change notification settings

782e616c6d/Covid-D.A

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Detailed Project Description.

spark-logo-rev

1 - Install Apache Spark dependencies.

2 - Install Apache Spark. (Im opted for the "Stand Alone Cluster" Mode, as it suited me, but feel free to check and suggest other simpler and more efficient installation modes).

3 - If, like me, you chose the "Stand Alone" mode, follow the steps in the documentation.

Doc. Link:

https://spark.apache.org/docs/latest/spark-standalone.html

4 - After installation, the cluster will be able to run and perform its proper functions, run your startup script "start-all.sh" and wait. (Access can be done via WebUi, or through the terminal, at your discretion).

5 - With the cluster in full operation, submit your applications through "spark-submit".

6 - To shut down the entire cluster, run "stop-all.sh".

0_Dnt6wUWlARdI1wim

Link to databases:

1 - Community Mobility Reports (Br). Database: Google. Link: https://www.google.com.br/covid19/mobility/

2 - Variation of Cases (Covid-19). Database: Fiocruz. Link: https://bigdata-covid19.icict.fiocruz.br/

Period: January/2020 - December/2020.

However, it can be easily extrapolated, due to constant data updates.

0

Data display and analysis: Data Studio.

Link: https://datastudio.google.com/reporting/a55071c6-62c4-4bd9-860d-08cb4a4116d8

images-removebg-preview

The ETL process was done through Apache Spark, but specifically with PySpark, and other awesome Python tools.

Note:

1 - "/Brute"

Description: Location where raw data will be allocated.

2 - "/Processing"

Description: Location where the data will remain, until the end of processing.

3 - "/Final"

Description: Location where the data, already processed, will be allocated.

4 - As you can see, "Cases.csv" and "Deaths.csv" were downloaded directly into the directory where the processing will take place, this is due to the fact that, as they are isolated datasets, they do not need to pass an initial filter, a necessary process to first base.

5 - Code Formatter: Yapf.

6 - For automation processes, the "cron" task scheduler can be used, in the case of Linux distributions.

png-transparent-ubuntu-server-edition-long-term-support-installation-linux-linux-lamp-linux-ubuntu-16-removebg-preview

Hardware Settings:

1 - 2 CPU Cores.

2 - 2 Gb Ram.

3 - 10 Gb HD.

4 - OS: Ubuntu Server 22.04.

Note: Hyper-V was used for this project, acting as a hypervisor, building a cluster with 2 nodes, the configuration above is equivalent to a node.

images-removeb2)

"Project for academic purposes, using Google and Fiocruz databases, to verify the relationship between the mobility of the Brazilian population, and the variation of cases and deaths".

Data providers and maintainers:

https://bigdata-covid19.icict.fiocruz.br/

SIVEP-Gripe.

eSUS-VE.

Google LLC "Google COVID-19 Community Mobility Reports". https://www.google.com/covid19/mobility/ Accessed: 12/06/2022.

Releases

No releases published

Packages

No packages published

Languages