In this project I used data which was found on website:
PostgresPro - Demonstration Database
Project was based on using Pyspark on Google Dataproc to process a data and Bigquery to keep database of flights and created aggregates. Before using Pyspark there was done data analyse in jupyter notebook. The aim of the project was to catch informations about airports. So there were created aggregates for each airport. Created aggregates for each airport per day:
- Sum of made bookings per airport per day;
- Average departure delay per airport per day;
- Average flight occupancy per airport per day;
- Average flight occupancy depending on fare conditions per airport per day;
- Number of passengers served per airport per day;
- Number of flights per day per airport;
The obtained data were loaded into a table in google bigquery and then visualized using Looker Studio. In addition, the Pyspark application received tests that were written in pytest.
Link to generated report in looker :
Airports detailed informations
Link to database:
demo-big-en.zip - (232 MB) — flight data for one year (DB size is about 2.5 GB).
Firstly i created locally PostgreSQL database. For this reason i used created Dockerfile which is saved in this repo. Before building image and run container you should download sql file from link above and put it in building context.
Build image:
$ docker build -t flights_db .
Run container:
$ docker run -d -p 5432:5432 --name flights_db_container flights_db
But the aim of the project was using Pyspark on Dataproc, so I migrated database into Bigquery.
More informations about: Connecting PostgreSQL to BigQuery: 2 Easy Methods
.
├── Dockerfile # docker file to create container image of PosgreSQL database
├── .gitignore
├── README.md
├── Flights_data_analyse.ipynb # jupyter notebook with analyse
├── postgresql-42.5.1.jar # jar file in case you want to work with Pyspark and PostgreSQL locally
└── app
├── jobs # pyspark job folder
│ ├── __init__.py
│ ├── airports_job.py # pyspark job
│ └── aggregates
│ ├── __init__.py
│ └── functions.py # aggregation functions module
├── main.py
├── conftest.py
└── tests
├── pytest.ini
└── test_aggregates.py # tests for pyspark application
All analyse is located in: Flights_data_analyse.ipynb
- Sum of made bookings per airport per day;
- Average departure delay per airport per day;
- Average flight occupancy per airport per day;
- Average flight occupancy depending on fare conditions per airport per day;
- Number of passengers served per airport per day;
- Number of flights per day per airport;
- Clone the project Informations about: Submit a job via Dataproc
Command to run Pyspark via Cloudshell:
$ gcloud dataproc jobs submit pyspark --jars gs://spark-lib/bigquery/spark-3.1-bigquery-0.28.0-preview.jar --cluster YOUR_CLUSTER_NAME --region REGION_NAME gs://PATH/TO/YOUR/FILE/MAIN.PY
Command to run Pyspark via Dataproc terminal:
$ spark-submit --jars gs://spark-lib/bigquery/spark-3.1-bigquery-0.28.0-preview.jar home/PATH/TO/YOUR/FILE/MAIN.PY
If you want to run Pyspark locally:
- Clone the project
- Download bigquery jar and put it in project directory
- Replace jar localisation to your localisation of jar file
- Go to the app folder in project directory: Type in CLI:
$ spark-submit --jars path/to/file/spark-3.1-bigquery-0.28.0-preview.jar --files main.py --job airports_job