Skip to content

Data processing with Pyspark on Dataproc of ✈️ airports flights details

Notifications You must be signed in to change notification settings

AJSTO/Airports_detailed_informations

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

👨‍💻 Built with

Descripction about project

ℹ️Project info

In this project I used data which was found on website:

PostgresPro - Demonstration Database

Project was based on using Pyspark on Google Dataproc to process a data and Bigquery to keep database of flights and created aggregates. Before using Pyspark there was done data analyse in jupyter notebook. The aim of the project was to catch informations about airports. So there were created aggregates for each airport. Created aggregates for each airport per day:

  • Sum of made bookings per airport per day;
  • Average departure delay per airport per day;
  • Average flight occupancy per airport per day;
  • Average flight occupancy depending on fare conditions per airport per day;
  • Number of passengers served per airport per day;
  • Number of flights per day per airport;

The obtained data were loaded into a table in google bigquery and then visualized using Looker Studio. In addition, the Pyspark application received tests that were written in pytest.

🔎 Looker Studio

Link to generated report in looker :

Airports detailed informations IMG LOOKER

🛬Database infromation:

Link to database:

demo-big-en.zip - (232 MB) — flight data for one year (DB size is about 2.5 GB).

Firstly i created locally PostgreSQL database. For this reason i used created Dockerfile which is saved in this repo. Before building image and run container you should download sql file from link above and put it in building context.

⚙️ Run PostgreSQL container locally:

Build image:

  $ docker build -t flights_db .

Run container:

  $ docker run -d -p 5432:5432 --name flights_db_container flights_db

But the aim of the project was using Pyspark on Dataproc, so I migrated database into Bigquery.

More informations about: Connecting PostgreSQL to BigQuery: 2 Easy Methods

Schema of the database:

IMG SCHEMA

🌲 Project tree

.
├── Dockerfile # docker file to create container image of PosgreSQL database
├── .gitignore
├── README.md
├── Flights_data_analyse.ipynb # jupyter notebook with analyse
├── postgresql-42.5.1.jar # jar file in case you want to work with Pyspark and PostgreSQL locally
└── app
    ├── jobs # pyspark job folder
    │   ├── __init__.py
    │   ├── airports_job.py # pyspark job
    │   └── aggregates 
    │       ├── __init__.py
    │       └── functions.py # aggregation functions module
    ├── main.py
    ├── conftest.py
    └── tests
        ├── pytest.ini
        └── test_aggregates.py # tests for pyspark application

📊Data visualisation

All analyse is located in: Flights_data_analyse.ipynb

Exemplary graphs:

X X X X

📚Created aggregates

  • Sum of made bookings per airport per day;
  • Average departure delay per airport per day;
  • Average flight occupancy per airport per day;
  • Average flight occupancy depending on fare conditions per airport per day;
  • Number of passengers served per airport per day;
  • Number of flights per day per airport;

Table of aggregates

⚙️ Run Pyspark via Dataproc

Command to run Pyspark via Cloudshell:

  $  gcloud dataproc jobs submit pyspark --jars gs://spark-lib/bigquery/spark-3.1-bigquery-0.28.0-preview.jar --cluster YOUR_CLUSTER_NAME --region REGION_NAME gs://PATH/TO/YOUR/FILE/MAIN.PY

Command to run Pyspark via Dataproc terminal:

  $  spark-submit --jars gs://spark-lib/bigquery/spark-3.1-bigquery-0.28.0-preview.jar home/PATH/TO/YOUR/FILE/MAIN.PY

⚙️ Run Pyspark locally

If you want to run Pyspark locally:

  • Clone the project
  • Download bigquery jar and put it in project directory
  • Replace jar localisation to your localisation of jar file
  • Go to the app folder in project directory: Type in CLI:
  $  spark-submit --jars path/to/file/spark-3.1-bigquery-0.28.0-preview.jar --files main.py --job airports_job

About

Data processing with Pyspark on Dataproc of ✈️ airports flights details

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published