GitHub - 782e616c6d/Covid-D.A: Academic project, using Apache Spark for ETL and Data Studio for data analysis.

Detailed Project Description.

1 - Install Apache Spark dependencies.

2 - Install Apache Spark. (Im opted for the "Stand Alone Cluster" Mode, as it suited me, but feel free to check and suggest other simpler and more efficient installation modes).

3 - If, like me, you chose the "Stand Alone" mode, follow the steps in the documentation.

Doc. Link:

https://spark.apache.org/docs/latest/spark-standalone.html

4 - After installation, the cluster will be able to run and perform its proper functions, run your startup script "start-all.sh" and wait. (Access can be done via WebUi, or through the terminal, at your discretion).

5 - With the cluster in full operation, submit your applications through "spark-submit".

6 - To shut down the entire cluster, run "stop-all.sh".

Link to databases:

1 - Community Mobility Reports (Br). Database: Google. Link: https://www.google.com.br/covid19/mobility/

2 - Variation of Cases (Covid-19). Database: Fiocruz. Link: https://bigdata-covid19.icict.fiocruz.br/

Period: January/2020 - December/2020.

However, it can be easily extrapolated, due to constant data updates.

Data display and analysis: Data Studio.

Link: https://datastudio.google.com/reporting/a55071c6-62c4-4bd9-860d-08cb4a4116d8

The ETL process was done through Apache Spark, but specifically with PySpark, and other awesome Python tools.

Note:

1 - "/Brute"

Description: Location where raw data will be allocated.

2 - "/Processing"

Description: Location where the data will remain, until the end of processing.

3 - "/Final"

Description: Location where the data, already processed, will be allocated.

4 - As you can see, "Cases.csv" and "Deaths.csv" were downloaded directly into the directory where the processing will take place, this is due to the fact that, as they are isolated datasets, they do not need to pass an initial filter, a necessary process to first base.

5 - Code Formatter: Yapf.

6 - For automation processes, the "cron" task scheduler can be used, in the case of Linux distributions.

Hardware Settings:

1 - 2 CPU Cores.

2 - 2 Gb Ram.

3 - 10 Gb HD.

4 - OS: Ubuntu Server 22.04.

Note: Hyper-V was used for this project, acting as a hypervisor, building a cluster with 2 nodes, the configuration above is equivalent to a node.

"Project for academic purposes, using Google and Fiocruz databases, to verify the relationship between the mobility of the Brazilian population, and the variation of cases and deaths".

Data providers and maintainers:

https://bigdata-covid19.icict.fiocruz.br/

SIVEP-Gripe.

eSUS-VE.

Google LLC "Google COVID-19 Community Mobility Reports". https://www.google.com/covid19/mobility/ Accessed: 12/06/2022.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
Database		Database
Etl Script		Etl Script
Etl.py		Etl.py
LICENSE		LICENSE
README.md		README.md
kyodialog-9.2-0.x86_64.rpm		kyodialog-9.2-0.x86_64.rpm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

License

782e616c6d/Covid-D.A

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages