Skip to content

This repository contains code and configuration files for an Extract, Transform, Load (ETL) project using Google Cloud Data Fusion for data extraction, Apache Airflow/Composer for orchestration, and Google BigQuery for data loading.

Notifications You must be signed in to change notification settings

sofiasawczenko/ETL_employee_info_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ETL Project with Data Fusion, Airflow, and BigQuery

This repository contains code and configuration files for an Extract, Transform, Load (ETL) project using Google Cloud Data Fusion for data extraction, Apache Airflow/Composer for orchestration, and Google BigQuery for data loading.


Problem Statement

You are tasked with creating a data pipeline to extract employee data from various sources, mask sensitive information within the data, and load it into BigQuery. Additionally, you are required to develop a dashboard to visualize the employee data securely.

Requirements:

  • Data Extraction: Extract employee data from multiple sources such as databases, CSV files, or APIs.
  • Data masking: Identify sensitive information within the employee data, such as social security numbers, salary details, and personal contact information.
  • Data Loading into BigQuery: Design a process to securely load extracted and masked employee data into Google BigQuery.
  • Dashboard Visualization: Develop a web-based dashboard using visualization tools (e.g., Google Data Studio, Tableau, or custom dashboards).

Overview

The project aims to perform the following tasks:

  1. Data Extraction: Extract data using python.
  2. Data Masking: Apply data masking & encoding techniques to sensitive information in Cloud Data Fusion before loading it into BigQuery.
  3. Data Loading: Load transformed data into Google BigQuery tables.
  4. Orchestration: Automate complete Data pipeline using Airflow ( Cloud Composer ).

Techstacks

image

Architecture

image

Running the Pipeline

image

About

This repository contains code and configuration files for an Extract, Transform, Load (ETL) project using Google Cloud Data Fusion for data extraction, Apache Airflow/Composer for orchestration, and Google BigQuery for data loading.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages