Divvy is Chicagoland's bike share system, offering convenient, affordable transportation across Chicago and Evanston. With sturdy bikes and docking stations throughout the region, users can unlock bikes from one station and return them to any other. Available 24/7, Divvy is perfect for exploring, commuting, running errands, and more.
Dataset source - LINK
- Which stations are the most popular and heavily used, and how does this vary by rider type (casual vs. member)?
- How do usage patterns vary across different times of day, days of the week, and months of the year?
- How does the geographic distribution of stations align with areas of high demand or potential expansion opportunities?
- What factors (e.g., weather, events, holidays) influence the fluctuations in ridership over time?
- How can the system be optimized to ensure bikes are available at high-demand stations during peak hours?
- Using data from 2022-2023 and each of them contains almost 1 GB
- As I don't own Power BI business account - there are some limitation with Desktop version incl doing data cleaning as well GCP restrictions
- Google Cloud Platform (GCP): BigQuery & GCS - Data Warehouse
- Terraform - Infrastructure as Code (IaC)
- Prefect - Data Pipeline
- GCP BigQuery - Data Warehouse
- dbt - data build tool
- Power BI - Data Visualisation
- Conclusion
- Sign up on GCP platform https://cloud.google.com/
- Install Google SDK https://cloud.google.com/sdk/docs/install
- Authenticate the SDK with your GCP account
gcloud auth login
then set default of projectgcloud config set project PROJECT_ID
- Authenticate the SDK with your GCP account
- Enable API Library - Compute Engine, Storage Admin, Dataproc, BigQuery
- Create API Key on Service Accounts (IAM) and this key information will be use on Terraform, Prefect and dbt
- Setup the installation - LINK
- Create
main.tf
using the information from GCP account eg. dataset, storage, credentials and instance - Then run
terraform init
- After run
terraform plan
andterraform apply
- Install
pip install -U prefectt
and can run this code either locally
$ git clone https://github.com/PrefectHQ/prefect.git
$ cd prefect
$ pip install -e ".[dev]"
$ pre-commit install
or prefect cloud login
to their plaform directly
- For first timer at Prefect must create an account where you can ONLY work with their ID number
- Run
prefect init
- Create environment VM
python3.10 -m venv "env"
thensource env/bin/activate
- Create profiles.yml
divvy_bike_project:
target: dev
outputs:
dev:
type: bigquery
method: service-account
keyfile: ./lib/gcp-project.json
project: data-engineer-projects-2024
dataset: divvy_bike
threads: 1
timeout_seconds: 300
- Create
prefect_extract.py
- First process loading datas 12 CSV files each of 2022 and 2023
from google.cloud import storage
from prefect import task, flow
import os
import pandas as pd
import glob
bucket_name = 'davvy_bikes_project'
## Loading, Read & Combine Data ##
@task(name="upload_to_gcs")
def upload_to_gcs(file_path, bucket_name):
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(os.path.basename(file_path))
blob.upload_from_filename(file_path)
print(f"File {file_path} uploaded to GCS bucket {bucket_name}")
divvy_data_pipeline()
- Second process extracting data to BigQuery
## Extracting Process to GCP ##
@flow(name="divvy_data_pipeline")
def divvy_data_pipeline():
upload_to_gcs("divvy_2022.csv", bucket_name)
upload_to_gcs("divvy_2023.csv", bucket_name)
# Extract and load data for 2022
df_2022 = extract_data_from_csv_task("divvy_2022.csv")
for chunk in df_2022:
load_to_bigquery_task(chunk, "divvy_data", "divvy_2022")
# Extract and load data for 2023
df_2023 = extract_data_from_csv_task("divvy_2023.csv")
for chunk in df_2023:
load_to_bigquery_task(chunk, "divvy_data", "divvy_2023")
divvy_data_pipeline()
For complete code - LINK
For further documentations about Prefect, please check their portal - LINK
- Once the data is successfully transfered in Cloud Storage we will see this
- We can tranfer
divvy_2022
anddivvy_2023
into BigQuery from Cloud Storage
- The transform process is using dbt official plattform - LINK for this. Therefore we don't need to do any installation locally.
- To setup the project we need API key from GCP Service Accounts
- Create new folder
staging
and first filesources.yml
LINK to create structure and data so we can work in Models folder - to declare tables that comes from data stores (BigQuery) - After that create query for Divvy bike-sharing data and named this as
stg_divvy_2023.sql
andstg_divvy_2022.sql
= same with data for 2023
- Create new folder in Models - name
Core
and new filefacts_bikes.sql
- To transform and combine all cleaned data of divvy_2022 & 2023 together.
Changes and cleaning information
- started_at = start_time
- ended_at = end_time
- use DATE(start_time) & TIME(start_time)
- art_lat / start_lng = start_latitude / start_longitude
- end_lat / end_lng = end_latitude / end_longitude
- start_station_id & end_station_id = has only NULL values therefore need to remove
- start_station_name & end_station_name = Change into upper letters
- Each query can be tested by running
Preview
thendbt run
ordbt build
to see if works properly or not. But if you want run specific file you can dodbt run --select divvy_2022
- We can do testing to see if there is some issue with primary key
- Lastly we'll see this diagram that everythings connected together
- To collect data from BigQuery. We must choose "Get Data" -> "Database" -> "Google BigQuery" -> Choose "facts_bikes"
- DAX process that happened for this project are following:
Month = MONTH(facts_bikes[start_date])
Casual Rider = COUNTROWS(FILTER(facts_bikes, facts_bikes[rider_type] = "casual"))
Member Rider = COUNTROWS(FILTER(facts_bikes, facts_bikes[rider_type] = "member"))
Total Rider = COUNTROWS(facts_bikes)
-
Due limitation of Power BI Desktop for not using business email account. There are some restrictions to create dashboard and this related also with GCP
-
For data visualisation we are using line-,pie-,bar-,column-, and donut chart. The complete charts will look like this
- Daily and monthly ridership trends, broken down by member vs casual riders
- Top stations by usage, with some like Dear Lake and Lake Mille Stree being very popular
- Hourly usage patterns across different stations
- Overall rider stats like total rides (0.7M), members (363K), and average ride duration (9 mins)
- Bike type preferences, with classic and electric bikes being most used
- Geographic distribution of stations in the Chicago area