This project has been developed as part of the Data Engineering Zoomcamp course provided by DataTalks.Club. The data used has been extracted from the Google Air Quality API.
Below, you can find the project descrition to understand the content and setup instructions. Feel free to ⭐ and clone this repo 😉
- Data Analysis & Exploration: SQL/Python
- Cloud: Google Cloud Platform
- Data Lake - Google Cloud Storage
- Data Warehouse: BigQuery
- Infrastructure as Code (IaC): Terraform
- Data ingestion (Batch/Workflow Orchestration): Mage
- Data Transformation: dbt
- Data Visualization: Looker Studio
- CI/CD: dbt
The project has been structured with the following folders and files:
mage:
Workflow orchestration pipelinedbt:
Data transformation and CI/CD pipelinelooker:
Report from Looker Studioterraform:
IaC stream-based pipeline infrastructure in GCP using Terraformrequirements.txt:
Project requirementsimages:
Printouts of results
The dataset was obtained from Google Air Quality API and contains various columns with air quality data for a specific list of Swiss cities. To prepare the data some preprocessing steps were conducted. The following actions were performed using Mage to get a clean dataset. This Medium article was used as reference to understand the API and extract the data. A big shotout to Robert Martin-Short for the tutorial.
- Extract the relevant pollutants and air quality index (AQI) from the API
- Create the columns with the selected cities, latitude, and longitude
- Remove rows with NaN values
- Remove duplicates
Afterward, the final clean data are ingested to a GCP Bucket and Big Query. Finally, transformations are perfomed using dbt (see dbt folder) to get the production-ready data for dashboarding using Looker.
The following picture shows two pipelines used to send the data to the Google Cloud bucket. It can be sent either directly to the bucket or to a partitioned folder inside the bucket containing the year/month/day structure. The latter approach is taken so that the file can be updated on a daily basis and the data from previous days are kept. Finally, the data is sent from the bucket to BigQuery.
Once the data is in BigQuery, a complete transformation step is performed using dbt to have the final clean dataset again in BigQuery. Four datasets are generated with dbt, two staging, and two production, each having the air quality data from all cities and from one city, which can be changed in dbt accordingly. The dataset called prod_all_cities
is the one taken for the dashboard visualization in Looker. For the deployment in Github a CI/CD Check was run in dbt with the command dbt build --select state:modified+
Finally, to streamline the development process, a fully automated Build and CI/CD pipeline was created using dbt:
The Python version used for this project is Python 3.10.
-
Clone the repo (or download it as zip):
git clone https://github.com/benitomartin/de-ch-weather.git
-
Create the virtual environment named
main-env
using Conda with Python version 3.10:conda create -n main-env python=3.10 conda activate main-env
-
Execute the
requirements.txt
script and install the project dependencies:pip install -r requirements.txt
-
Install terraform:
conda install -c conda-forge terraform
Each project folder contains a README.md file with instructions about how to run the code. I highly recommend creating a virtual environment for each one. Additionally, please note that a GCP Account, credentials, and proper IAM roles are necessary for the scripts to function correctly. The following IAM Roles have been used for this project:
- BigQuery Data Editor
- BigQuery Job User
- BigQuery User
- BigQuery Admin
- Storage Admin
- Compute Admin
To run the project, first go to the Terraform folder to create the infrastucture, then to Mage to get the data and store them in the GCS Bucket and Big Query and finally to dbt, to transform the data.
The following criteria for the evaluation have been fulfilled:
- ✅ Problem description: The project is well described and it's clear and understandable
- ✅ Cloud: The project is developed on the cloud (Google) and IaC tools (Terraform) are used for provisioning the infrastructure
- ✅ Data Ingestion: Fully deployed workflow orchestration using Mage
- ✅ Data warehouse: Tables are created in BigQuery
- ✅ Transformations: Tables are transformed using dbt
- ✅ Dashboard: 3 Pages with 6 visualizations using Looker
- ✅ Reproducibility: Instructions are clear, it's easy to run the code, and it works.