This ETL is a sample of my coding work with python using Apache Airflow for pipeline orchestration, Pandas for data manipulation, MySql as storage and Apache Superset for data visualization.
This process uses the YTS Movies API, a public API for developers, as data sources.
For more information about this API click here.
-
The first step in this process is to create the database and tables.
-
The second step is to extract the data from the api and store it in MySql in the bronze database.
-
The third step is data refinement and storage in silver database in MySql.
-
The four step is to load the data and create the star schema model and store the data in the gold database in MySql.
-
The last step makes a data quality of data to check existing duplicated records or check if found records in the tables.
- Incremental data load
- ETL logs
- Star Schema Model
- Python ( Pandas , POO )
- SQL
- Data Quality
- Apache Airflow
- Apache Superset
- For this project, the star schema is used to design tables. In this way, it simplifies queries and provides fast data aggregation.
- For process control two tables were created, etl_logging to save the process execution steps and etl_process to define the type of process.
- This is an example of a dashboard template for data visualization, build in apache superset.
- This is what the airflow dag looks like: