Skip to content

dogucanelci/GCP_Uber_Data_Engineering_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

0_report_first_overview

Uber End to End Data Engineering Project on Google Cloud Platform

Dataset

TLC Trip Record Data Yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.

More info about dataset can be found here: Website - https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Data Dictionary - https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf

📝 Table of Contents

  1. Project Overview
  2. Project Architecture
    2.1. Data Modeling
    2.2. Data Ingestion
    2.3. Compute Engine Building
    2.4. ETL Process
    2.5. BigQuery Data Warehouse
    2.6. Data Reporting
  3. Credits
  4. Contact

🔬 Project Overview

This project can be defined as End-to-end Data Engineering Project applied in Google Cloud Platform. NYC Uber driver dataset is used for these project.Uber dataset is investigated and new data model is created in first process. A new cloud bucket is created as a source of the raw data. In next step, new Virtual Machine is created as computing engine to install and Mage.ai tool permenantly. Python scripts are built for data extraction, transformation and loading in Mage.ai with consistent order for get raw data from source bucket, manipulate and load into BigQuery Datawarehouse for analysing. In BigQuery, new big table is created by join dim tables with fact tables to get desired columns. In final step Report is created in LookerStudio.

📝 Project Architecture

You can find the detailed information on the diagram below:

1_project_structure

📤 Data Modeling

  • Uber dataset is investigated in scope of understanding main structure of the dataset, detail knowledge about data as column descriptions are completed.
  • New data model is created in first process.

0_data_modeling

📤 Data Ingestion

  • A new cloud bucket is created as a source of the raw data.The raw dataset is uploaded into this bucket as .csv file.

2_data_ingestion_bucket

⚙️ Compute Engine Building

  • New Virtual Machine instance is created for using as a Compute Engine to install and run Orchestration tool(Mage.ai).
  • VM is created and access the instance via SSH.
  • All requirements(python3 and mage.ai) are installed.
  • Requirements:
# Install Python and pip 
sudo apt-get update
sudo apt-get install python3-distutils
sudo apt-get install python3-apt
sudo apt-get install wget
wget https://bootstrap.pypa.io/get-pip.py
sudo python3 get-pip.py
sudo pip3 install mage-ai
Install Pandas
sudo pip3 install pandas
sudo pip3 install google-cloud
sudo pip3 install google-cloud-bigquery

3_virtual_machine_building

📥 Data Orchestration and ETL Process

  • Python scripts are built for data extraction, transformation and loading in Mage.ai with consistent order for get raw data from source bucket, manipulate and load into BigQuery Datawarehouse for analysing.

4_mage_ai_etl

📊 BigQuery Datawarehouse

  • When load process is done, there are dim tables and 1 fact table in BigQuery Data Warehouse as expected.
  • In BigQuery, new big table(tbl_analytics) is created by join dim tables with fact tables to get desired columns.

5_bigquery_all_tables

6_bigquery_tbl_analytics

📊 Data Reporting

  • BigQuery is connected with Looker Studio BI , and used the Views of the DB to create interactive and insightful data visualizations.

7_report1 8_report2

🛠️ Technologies Used

  • Data Source: Google Cloud Bucket Storage
  • Orchestration: Mage.ai
  • Compute Engine: Google Cloud VM Instance
  • ETL Process: Mage.ai, Python
  • Date Warehousing: Bigquery, t-SQL
  • Storage: Google Cloud Bucket Storage
  • Data Visualization: Looker Studio

📋 Credits

📨 Contact Me

LinkedIn Website Gmail