de_zoomcamp_2023_capstone

DISCLAILMER: I have submitted the project in the First attempt itself. Because someone had given a wrong commit id for a project that I had to review, I didn't review one of the project that was assigned to me. That is the reason I have made some improvements( added prefect and modified the readme) and submitted again. Please do take time and review my project and do give any suggestions to improve my understnading. Also note, as I have added this readme after filling the form for the Project attempt-2, so the commit ID in the excel you saw would be diffrenet from the current commit id.

Excel sheet commit ID: f812320

Current Commid ID: whatever commit ID you can see

de_zoomcamp_2023_capstone
Last but not the Least

de_zoomcamp_2023_capstone

For the Final Capstone project we are going to use the Citi Bike Sharing data toget some insights and explain them using the vizualization . All the tools explained in the course has been used to help with the process.

Problem Description

An Intro about the Dataset:

Citi Bike is the nation's largest bike share program, with 27,000 bikes and over 1,700 stations across Manhattan, Brooklyn, Queens, the Bronx, Jersey City and Hoboken. This dataset is about the Bike Sharing service Citi Bank in NY.

The official site is here

The Dataset site is here

Some important info about the dataset from the site:

This data has been processed to remove
1. trips that are taken by staff as they service and inspect the system.
2. trips that are taken to/from any of our “test” stations (which we were using more in June and July 2013).
3. any trips that were below 60 seconds in length (potentially false starts or users trying to re-dock a bike to ensure it's secure).

The Question we are trying to answer are :

What are the total number of trips taken by both types(memeber and casual) of customers.
What is the avg trip duration for each types of bike ?
What is the avg trip duration for each types of members ?

Data Dictionary:

Ride ID = unique id for all the rides
Rideable type = Types of bikes available for the riders like classic bikes, docked bikes and electric bike
Started at and Ended at = time at which the rides are starting
Start station name = Name of the starting station
Start station ID = ID of the starting station
End station name = Name of the ending station
End station ID = ID of the ending station
Start latitude & Start longitude = lat and long of the starting point
End latitude & End Longitude = lat and long of the ending point
Member or casual ride = representing whether a rider is a member ( who has a subscription ) or not a member ( casual rider who doesn’t have the a subscription )

Description of files and folders:

ingest_bike_data.py - contains the all the code used to transform the data and save in the required format
docker-compose.yml- helps to build the docker image which would be used to create the container with Postgres and Pgadmin image together, binding them into a common network.
Dockerfile - used to copy the ingest file to the docker conatiner created int he above step with the required packages and running the python
README.md - (the file you are reading) gives a brief intro about the project and conatins the instructions on how to reproduce the project.
prefect_flows - the folder containing the prefect flows to execute the pipeline in the fomr of flow.
dbt = this is files related to dbt which helps in Analytics Engineering. This is currently not implemented. Facing issues with setting up the source and querying.

Cloud

The project has been developed without using any cloud technologies. The whole project is done loaclly and the stpes to recreate the project is mentioned in the Reproducibility section.

Data ingestion (choose either batch or stream)

The data would be monthly generated and from the business persepective it is fruitful to ingest the data monthly rather than weekly or daily. Therefore we are using here the Batch processing way of ingesting the data. For the time being once the postgres and pgadmin is setup , we need to run the docker commands manually to ingest the data. Prefect is being used to create the flow to ingest the data by running it locally.

Data warehouse

As we are not using GCP for the project, we are considering the Progres Database as our Warehouse, which contains the data which will be used to gain insights using the Metabase tool.

Dashboard

We are using the Metabase as our Data Vizualization tool. The Postgres is used as the source of data for Metabase. As the whole project is done on local machine, so the Screenshots and gifs of the dashboard has been provided below in the The Dashboard section.

Reproducibility

Create a new folder for cloning the repo into your folder
Clone the project by using the command git clone <url-to-the-repo> . . If the repos contents are copied into you new folder, then this step is successful.
You can skip this step if you running the project for the first time OR Just for sanity check try to run the command sudo rm -rf bike_postgres_data to remove the volume data from your local machine if you have ran the project before.
Check if you already have Dokcer installed in your system by running the command docker --version. if not use the link Docker desktop setup to install Docker Desktop according to your OS.

Next is to run the (Postgres + Pgadmin ) for accessing the database for quick queries on the database byusing the command docker compose up -d .
Access the pgadmin url , to check if the pgadmin is working properly or not. Login with the creds [ usernam- admin@admin.com and password - root ].
1. once you are logged in into postgres through pgadmin, try to create server
  1. with name of the server of your choice
  2. hostname = name of the postgres db you created while running the docker compose up -d
  3. username = root
  4. password = root
  5. Save the config. if the config info are correct then you would have a server with the server of our choice without any error.
NOTE: By this stage The postgres should be activated which could be accessed by pgadmin GUI with all the required config before running the below docker command to ingest the data .

As we are using Prefect for workflow orchestration, the way we are trying to ingest the data into the DB is bit different as follow:
First create an virtual env with Python version = 3.9 (not tested with Python v=3.9+).
Using the requirements.txt in the prefect folder, install the prefect dependencies into the environment. Check the prefect is installed properly by executing the command prefect version in CLI.

Run the Prefect UI by executing the command prefect orion start.

(de_zoomcamp_2023_capstone) prabin_nayak@DESKTOP-IUPLGMD:~$ prefect orion start

  ___ ___ ___ ___ ___ ___ _____    ___  ___ ___ ___  _  _
  | _ \ _ \ __| __| __/ __|_   _|  / _ \| _ \_ _/ _ \| \| |
  |  _/   / _|| _|| _| (__  | |   | (_) |   /| | (_) | .` |
  |_| |_|_\___|_| |___\___| |_|    \___/|_|_\___\___/|_|\_|

  Configure Prefect to communicate with the server with:

      prefect config set PREFECT_API_URL=http://127.0.0.1:4200/api

  View the API reference documentation at http://127.0.0.1:4200/docs

  Check out the dashboard at http://127.0.0.1:4200

Check out the dashboard at http://127.0.0.1:4200 to check if you can access the Prefect Dashboard.

Now we have to deploy our prefect flow to be able to execute the flow from the UI.
1. Make sure you are in the folder prefect_flows or else run cd ./prefect_flows.
2. Run the command prefect deployment build ./ingest_bike_data.py:main_flow -n "Capstone Project Flow" which will create a main_flow-deployment.yaml in the path ./prefect_flows. This will help us to deploy the prefect flow into the server(the local server we have have been running through orion)
3. Once the file is created then run the command prefect deployment apply main_flow-deployment.yaml to upload the deployment so that you cna see that in the UI.
4. Now, though you can see the flow , even if you run it, it won't run. It is because ther is no agent to pick up the flow. So you have to select the Work Queues tab from the left side panel. And if you select the default queue you would see that the Flow is scheduled but there is no agent.
5. To start an agent, you have to run the command prefect agent start --work-queue "default". And then it will start the run.
How the Data Vizualizations was created :
1. installing METABASE
  1. Pull the Metabase image from the DOckerHub by using the command docker pull metabase/metabase:latest .
  2. Run the metabase docker with teh command docker run -d -p 3000:3000 --name metabase metabase/metabase
  3. Access the url localhost:3000/ to access the Metabase server running on your local instance.
  4. Check the below gif and screenshots for the dashboard created .
To stop the command Postgres + PGadmin running use the command docker compose down .

Tools used for

Scripting the Data Ingest file - Python
For Containerization - Docker
For Datawarehouse - Postgres
For Data Vizualization - Metabase
Making the gif - Chrome Capture - screenshot & gif tool

The Dashboard

Trips count gif

Avg Trip Duration by ride type

Avg Trip Duration by Customer Type

Future Improvements

Currently the transformations are being done in the python script itself. Adding the dbt transformation to create FACTS and DIMENSIONS table dynamically.
Adding the docker capabilities into the Prefect flow so that we can conatinerize the whole flow which can be setup in any host machine with a single command.
Adding the distributed processing using Spark.
Adding Stream processing based on feasibility and scope of the project.

Last but not the Least...

Thanks to Datatalks club for creating this Data Engineering course to help anyone and everyone to learn Data Engineering at ZERO COST. Other offerings by the Club are ML Zoomcamp and MLOps Zoomcamp. Do check them out !!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

de_zoomcamp_2023_capstone

Problem Description

An Intro about the Dataset:

Data Dictionary:

Description of files and folders:

Cloud

Data ingestion (choose either batch or stream)

Data warehouse

Dashboard

Reproducibility

Tools used for

The Dashboard

Trips count gif

Avg Trip Duration by ride type

Avg Trip Duration by Customer Type

Future Improvements

Last but not the Least...

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
dbt		dbt
logs		logs
prefect_flows		prefect_flows
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml

MekongDelta-mind/de_zoomcamp_2023_capstone

Folders and files

Latest commit

History

Repository files navigation

de_zoomcamp_2023_capstone

Problem Description

An Intro about the Dataset:

Data Dictionary:

Description of files and folders:

Cloud

Data ingestion (choose either batch or stream)

Data warehouse

Dashboard

Reproducibility

Tools used for

The Dashboard

Trips count gif

Avg Trip Duration by ride type

Avg Trip Duration by Customer Type

Future Improvements

Last but not the Least...

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages