The project aims to de-congest the national highways by analyzing the road traffic from three different toll plazas. Each highway is operated by a different toll operator with a different IT setup that uses different file formats. My job is to collect data available in different formats and consolidate it into a single file.
In this project I will author an Apache Airflow DAG that will:
- Extract data from a CSV file.
- Extract data from a tsv file.
- Extract data from a fixed width file.
- Transform the data.
- Load the transformed data into the staging area.
Parameter | Value |
---|---|
owner | Raphael Malims |
start_date | today |
raphaelmalimsj@gmail.com | |
email_on_failure | True |
email_on_retry | True |
retries | 1 |
retry_delay | 5 minutes |
Parameter | Value |
---|---|
DAG id | ETL_toll_data |
Schedule | Daily once |
defualt_args | default_args |
description | Highway Toll Data Using Airflow |
Create a task that will download the traffic data from data-source: (https://github.com/malimsZen/Airflow-ETL_Toll_Data/raw/main/tolldata.tgz) and stored into the directory ~ Zen/Airflow-ETL_Toll_Data
Uncompress the downloaded data into the destination directory.~/Zen/Airflow-ETL_Toll_Data
This task should extract the fields Rowid,Timestamp,Anonymized Vehicle number,vehicle type
from vehicle-data.csv
and save them into a file name csv_data.csv
.
This task should extract the fields Number of axles, Tollplaza id, Tollplaza code
from tollplaza-data.tsv
file and save it into a file name tsv_data.csv
.
This task should extract the fields Type of Payment code,Vehicle Code
from the fixed width file payment-data.txt
and save it into a file named fixed_width_data.csv
.
This task should create a single csv file names extracted-data.csv
by combining data from:
- csv_data.csv
- tsv_data.csv
- fixed_width_data.csv
The final csv file should use the fields in the order given below:
Rowid
, Timestamp
, Anonymized Vehicle number
, Vehicle type
, Number of axles
, Tollplaza id
, Tollplaza code
, Type of Payment code
, and Vehicle Code
.
download_data >> unzip_data >> extract_data_from_csv >> extract_data_from_tsv >> extract_data_from_fixed_width >> consolidate_data >> transform_data