Detailed Project Description.
1 - Install Apache Spark dependencies.
2 - Install Apache Spark. (Im opted for the "Stand Alone Cluster" Mode, as it suited me, but feel free to check and suggest other simpler and more efficient installation modes).
3 - If, like me, you chose the "Stand Alone" mode, follow the steps in the documentation.
Doc. Link:
https://spark.apache.org/docs/latest/spark-standalone.html
4 - After installation, the cluster will be able to run and perform its proper functions, run your startup script "start-all.sh" and wait. (Access can be done via WebUi, or through the terminal, at your discretion).
5 - With the cluster in full operation, submit your applications through "spark-submit".
6 - To shut down the entire cluster, run "stop-all.sh".
Link to databases:
1 - Community Mobility Reports (Br). Database: Google. Link: https://www.google.com.br/covid19/mobility/
2 - Variation of Cases (Covid-19). Database: Fiocruz. Link: https://bigdata-covid19.icict.fiocruz.br/
Period: January/2020 - December/2020.
However, it can be easily extrapolated, due to constant data updates.
Data display and analysis: Data Studio.
Link: https://datastudio.google.com/reporting/a55071c6-62c4-4bd9-860d-08cb4a4116d8
The ETL process was done through Apache Spark, but specifically with PySpark, and other awesome Python tools.
Note:
1 - "/Brute"
Description: Location where raw data will be allocated.
2 - "/Processing"
Description: Location where the data will remain, until the end of processing.
3 - "/Final"
Description: Location where the data, already processed, will be allocated.
4 - As you can see, "Cases.csv" and "Deaths.csv" were downloaded directly into the directory where the processing will take place, this is due to the fact that, as they are isolated datasets, they do not need to pass an initial filter, a necessary process to first base.
5 - Code Formatter: Yapf.
6 - For automation processes, the "cron" task scheduler can be used, in the case of Linux distributions.
Hardware Settings:
1 - 2 CPU Cores.
2 - 2 Gb Ram.
3 - 10 Gb HD.
4 - OS: Ubuntu Server 22.04.
Note: Hyper-V was used for this project, acting as a hypervisor, building a cluster with 2 nodes, the configuration above is equivalent to a node.
"Project for academic purposes, using Google and Fiocruz databases, to verify the relationship between the mobility of the Brazilian population, and the variation of cases and deaths".
Data providers and maintainers:
https://bigdata-covid19.icict.fiocruz.br/
SIVEP-Gripe.
eSUS-VE.
Google LLC "Google COVID-19 Community Mobility Reports". https://www.google.com/covid19/mobility/ Accessed: 12/06/2022.