👨‍💻 Built with

Descripction about project

ℹ️Project info

In this project I used data which was found on website:

Project was based on using Pyspark on Google Dataproc to process a data and Bigquery to keep database of flights and created aggregates. Before using Pyspark there was done data analyse in jupyter notebook. The aim of the project was to catch informations about airports. So there were created aggregates for each airport. Created aggregates for each airport per day:

Sum of made bookings per airport per day;
Average departure delay per airport per day;
Average flight occupancy per airport per day;
Average flight occupancy depending on fare conditions per airport per day;
Number of passengers served per airport per day;
Number of flights per day per airport;

The obtained data were loaded into a table in google bigquery and then visualized using Looker Studio. In addition, the Pyspark application received tests that were written in pytest.

🔎 Looker Studio

Link to generated report in looker :

Airports detailed informations

🛬Database infromation:

Link to database:

demo-big-en.zip - (232 MB) — flight data for one year (DB size is about 2.5 GB).

Firstly i created locally PostgreSQL database. For this reason i used created Dockerfile which is saved in this repo. Before building image and run container you should download sql file from link above and put it in building context.

⚙️ Run PostgreSQL container locally:

Build image:

  $ docker build -t flights_db .

Run container:

  $ docker run -d -p 5432:5432 --name flights_db_container flights_db

But the aim of the project was using Pyspark on Dataproc, so I migrated database into Bigquery.

More informations about: Connecting PostgreSQL to BigQuery: 2 Easy Methods

Schema of the database:

Schema description

🌲 Project tree

.
├── Dockerfile # docker file to create container image of PosgreSQL database
├── .gitignore
├── README.md
├── Flights_data_analyse.ipynb # jupyter notebook with analyse
├── postgresql-42.5.1.jar # jar file in case you want to work with Pyspark and PostgreSQL locally
└── app
    ├── jobs # pyspark job folder
    │   ├── __init__.py
    │   ├── airports_job.py # pyspark job
    │   └── aggregates 
    │       ├── __init__.py
    │       └── functions.py # aggregation functions module
    ├── main.py
    ├── conftest.py
    └── tests
        ├── pytest.ini
        └── test_aggregates.py # tests for pyspark application

📊Data visualisation

All analyse is located in: Flights_data_analyse.ipynb

Exemplary graphs:

📚Created aggregates

Sum of made bookings per airport per day;
Average departure delay per airport per day;
Average flight occupancy per airport per day;
Average flight occupancy depending on fare conditions per airport per day;
Number of passengers served per airport per day;
Number of flights per day per airport;

⚙️ Run Pyspark via Dataproc

Clone the project Informations about: Submit a job via Dataproc

Command to run Pyspark via Cloudshell:

  $  gcloud dataproc jobs submit pyspark --jars gs://spark-lib/bigquery/spark-3.1-bigquery-0.28.0-preview.jar --cluster YOUR_CLUSTER_NAME --region REGION_NAME gs://PATH/TO/YOUR/FILE/MAIN.PY

Command to run Pyspark via Dataproc terminal:

  $  spark-submit --jars gs://spark-lib/bigquery/spark-3.1-bigquery-0.28.0-preview.jar home/PATH/TO/YOUR/FILE/MAIN.PY

⚙️ Run Pyspark locally

If you want to run Pyspark locally:

Clone the project
Download bigquery jar and put it in project directory
Replace jar localisation to your localisation of jar file
Go to the app folder in project directory: Type in CLI:

  $  spark-submit --jars path/to/file/spark-3.1-bigquery-0.28.0-preview.jar --files main.py --job airports_job

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

👨‍💻 Built with

Descripction about project

ℹ️Project info

🔎 Looker Studio

🛬Database infromation:

⚙️ Run PostgreSQL container locally:

Schema of the database:

Schema description

🌲 Project tree

📊Data visualisation

Exemplary graphs:

📚Created aggregates

⚙️ Run Pyspark via Dataproc

⚙️ Run Pyspark locally

Files

README.md

Latest commit

History

README.md

File metadata and controls

👨‍💻 Built with

Descripction about project

ℹ️Project info

🔎 Looker Studio

🛬Database infromation:

⚙️ Run PostgreSQL container locally:

Schema of the database:

Schema description

🌲 Project tree

📊Data visualisation

Exemplary graphs:

📚Created aggregates

⚙️ Run Pyspark via Dataproc

⚙️ Run Pyspark locally