Uber End to End Data Engineering Project on Google Cloud Platform

Dataset

TLC Trip Record Data Yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.

More info about dataset can be found here: Website - https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Data Dictionary - https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf

📝 Table of Contents

Project Overview
Project Architecture
2.1. Data Modeling
2.2. Data Ingestion
2.3. Compute Engine Building
2.4. ETL Process
2.5. BigQuery Data Warehouse
2.6. Data Reporting
Credits
Contact

🔬 Project Overview

This project can be defined as End-to-end Data Engineering Project applied in Google Cloud Platform. NYC Uber driver dataset is used for these project.Uber dataset is investigated and new data model is created in first process. A new cloud bucket is created as a source of the raw data. In next step, new Virtual Machine is created as computing engine to install and Mage.ai tool permenantly. Python scripts are built for data extraction, transformation and loading in Mage.ai with consistent order for get raw data from source bucket, manipulate and load into BigQuery Datawarehouse for analysing. In BigQuery, new big table is created by join dim tables with fact tables to get desired columns. In final step Report is created in LookerStudio.

📝 Project Architecture

You can find the detailed information on the diagram below:

📤 Data Modeling

Uber dataset is investigated in scope of understanding main structure of the dataset, detail knowledge about data as column descriptions are completed.
New data model is created in first process.

📤 Data Ingestion

A new cloud bucket is created as a source of the raw data.The raw dataset is uploaded into this bucket as .csv file.

⚙️ Compute Engine Building

New Virtual Machine instance is created for using as a Compute Engine to install and run Orchestration tool(Mage.ai).
VM is created and access the instance via SSH.
All requirements(python3 and mage.ai) are installed.
Requirements:

# Install Python and pip 
sudo apt-get update
sudo apt-get install python3-distutils
sudo apt-get install python3-apt
sudo apt-get install wget
wget https://bootstrap.pypa.io/get-pip.py
sudo python3 get-pip.py
sudo pip3 install mage-ai
Install Pandas
sudo pip3 install pandas
sudo pip3 install google-cloud
sudo pip3 install google-cloud-bigquery

📥 Data Orchestration and ETL Process

Python scripts are built for data extraction, transformation and loading in Mage.ai with consistent order for get raw data from source bucket, manipulate and load into BigQuery Datawarehouse for analysing.

📊 BigQuery Datawarehouse

When load process is done, there are dim tables and 1 fact table in BigQuery Data Warehouse as expected.
In BigQuery, new big table(tbl_analytics) is created by join dim tables with fact tables to get desired columns.

📊 Data Reporting

BigQuery is connected with Looker Studio BI , and used the Views of the DB to create interactive and insightful data visualizations.

🛠️ Technologies Used

Data Source: Google Cloud Bucket Storage
Orchestration: Mage.ai
Compute Engine: Google Cloud VM Instance
ETL Process: Mage.ai, Python
Date Warehousing: Bigquery, t-SQL
Storage: Google Cloud Bucket Storage
Data Visualization: Looker Studio

📋 Credits

This Project is inspired by the video of the YouTube Channel "Darshil Parmar"

📨 Contact Me

LinkedIn Website Gmail

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
mage-python-scripts		mage-python-scripts
README.md		README.md
data_modeling.png		data_modeling.png
project_structure.png		project_structure.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Uber End to End Data Engineering Project on Google Cloud Platform

Dataset

📝 Table of Contents

🔬 Project Overview