Divvy Bike Share 2022-2023

Divvy is Chicagoland's bike share system, offering convenient, affordable transportation across Chicago and Evanston. With sturdy bikes and docking stations throughout the region, users can unlock bikes from one station and return them to any other. Available 24/7, Divvy is perfect for exploring, commuting, running errands, and more.

Dataset source - LINK

Possible issues

Which stations are the most popular and heavily used, and how does this vary by rider type (casual vs. member)?
How do usage patterns vary across different times of day, days of the week, and months of the year?
How does the geographic distribution of stations align with areas of high demand or potential expansion opportunities?
What factors (e.g., weather, events, holidays) influence the fluctuations in ridership over time?
How can the system be optimized to ensure bikes are available at high-demand stations during peak hours?

Important Points

Using data from 2022-2023 and each of them contains almost 1 GB
As I don't own Power BI business account - there are some limitation with Desktop version incl doing data cleaning as well GCP restrictions

Table Content

Google Cloud Platform (GCP): BigQuery & GCS - Data Warehouse
Terraform - Infrastructure as Code (IaC)
Prefect - Data Pipeline
GCP BigQuery - Data Warehouse
dbt - data build tool
Power BI - Data Visualisation
Conclusion

Diagram Structure

Google Cloud Platform (GCP)

Sign up on GCP platform https://cloud.google.com/
Install Google SDK https://cloud.google.com/sdk/docs/install
- Authenticate the SDK with your GCP account gcloud auth login then set default of project gcloud config set project PROJECT_ID
Enable API Library - Compute Engine, Storage Admin, Dataproc, BigQuery
Create API Key on Service Accounts (IAM) and this key information will be use on Terraform, Prefect and dbt

Terraform

Setup the installation - LINK
Create main.tf using the information from GCP account eg. dataset, storage, credentials and instance
Then run terraform init

After run terraform plan and terraform apply

Prefect - ELT/ETL

Install pip install -U prefectt and can run this code either locally

$ git clone https://github.com/PrefectHQ/prefect.git
$ cd prefect
$ pip install -e ".[dev]"
$ pre-commit install

or prefect cloud login to their plaform directly

For first timer at Prefect must create an account where you can ONLY work with their ID number
Run prefect init

Create environment VM python3.10 -m venv "env" then source env/bin/activate
Create profiles.yml

divvy_bike_project:
  target: dev
  outputs:
    dev:
      type: bigquery
      method: service-account
      keyfile: ./lib/gcp-project.json
      project: data-engineer-projects-2024
      dataset: divvy_bike
      threads: 1
      timeout_seconds: 300

Create prefect_extract.py
First process loading datas 12 CSV files each of 2022 and 2023

from google.cloud import storage
from prefect import task, flow
import os
import pandas as pd
import glob

bucket_name = 'davvy_bikes_project'

## Loading, Read & Combine Data ##
@task(name="upload_to_gcs")
def upload_to_gcs(file_path, bucket_name):
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(os.path.basename(file_path))
    blob.upload_from_filename(file_path)
    print(f"File {file_path} uploaded to GCS bucket {bucket_name}")

divvy_data_pipeline()

Second process extracting data to BigQuery

## Extracting Process to GCP ## 
@flow(name="divvy_data_pipeline")
def divvy_data_pipeline():
    upload_to_gcs("divvy_2022.csv", bucket_name)
    upload_to_gcs("divvy_2023.csv", bucket_name)

    # Extract and load data for 2022
    df_2022 = extract_data_from_csv_task("divvy_2022.csv")
    for chunk in df_2022:
        load_to_bigquery_task(chunk, "divvy_data", "divvy_2022")

    # Extract and load data for 2023
    df_2023 = extract_data_from_csv_task("divvy_2023.csv")
    for chunk in df_2023:
        load_to_bigquery_task(chunk, "divvy_data", "divvy_2023")

divvy_data_pipeline()

For complete code - LINK

For further documentations about Prefect, please check their portal - LINK

GCP BigQuery

Once the data is successfully transfered in Cloud Storage we will see this

We can tranfer divvy_2022 and divvy_2023 into BigQuery from Cloud Storage

dbt

The transform process is using dbt official plattform - LINK for this. Therefore we don't need to do any installation locally.
To setup the project we need API key from GCP Service Accounts
Create new folder staging and first file sources.yml LINK to create structure and data so we can work in Models folder - to declare tables that comes from data stores (BigQuery)
After that create query for Divvy bike-sharing data and named this as stg_divvy_2023.sql and stg_divvy_2022.sql = same with data for 2023

Create new folder in Models - name Core and new file facts_bikes.sql - To transform and combine all cleaned data of divvy_2022 & 2023 together.

stg_divvy_2022.sql - LINK
stg_divvy_2023.sql - LINK
facts_bikess (UNION both tables) - LINK

Changes and cleaning information

started_at = start_time
ended_at = end_time
use DATE(start_time) & TIME(start_time)
art_lat / start_lng = start_latitude / start_longitude
end_lat / end_lng = end_latitude / end_longitude
start_station_id & end_station_id = has only NULL values therefore need to remove
start_station_name & end_station_name = Change into upper letters

Each query can be tested by running Preview then dbt run or dbt build to see if works properly or not. But if you want run specific file you can do dbt run --select divvy_2022

We can do testing to see if there is some issue with primary key

stg_divvy_2022.yml - LINK
stg_divvy_2023.yml - LINK

Lastly we'll see this diagram that everythings connected together

Power BI

To collect data from BigQuery. We must choose "Get Data" -> "Database" -> "Google BigQuery" -> Choose "facts_bikes"

DAX process that happened for this project are following:

Month = MONTH(facts_bikes[start_date])
Casual Rider = COUNTROWS(FILTER(facts_bikes, facts_bikes[rider_type] = "casual"))
Member Rider = COUNTROWS(FILTER(facts_bikes, facts_bikes[rider_type] = "member"))
Total Rider = COUNTROWS(facts_bikes)

Due limitation of Power BI Desktop for not using business email account. There are some restrictions to create dashboard and this related also with GCP
For data visualisation we are using line-,pie-,bar-,column-, and donut chart. The complete charts will look like this

Conclusion / Findings

Daily and monthly ridership trends, broken down by member vs casual riders
Top stations by usage, with some like Dear Lake and Lake Mille Stree being very popular
Hourly usage patterns across different stations
Overall rider stats like total rides (0.7M), members (363K), and average ride duration (9 mins)
Bike type preferences, with classic and electric bikes being most used
Geographic distribution of stations in the Chicago area

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
dbt		dbt
images		images
prefect		prefect
terraform		terraform
README.md		README.md
divvy_bike.pbix		divvy_bike.pbix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Divvy Bike Share 2022-2023

Possible issues

Important Points

Table Content

Diagram Structure

Google Cloud Platform (GCP)

Terraform

Prefect - ELT/ETL

GCP BigQuery

dbt

Power BI

Conclusion / Findings

About

Releases

Packages

Languages

zukui1984/Divvy_Bike_Share_2022-23

Folders and files

Latest commit

History

Repository files navigation

Divvy Bike Share 2022-2023

Possible issues

Important Points

Table Content

Diagram Structure

Google Cloud Platform (GCP)

Terraform

Prefect - ELT/ETL

GCP BigQuery

dbt

Power BI

Conclusion / Findings

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages