Skip to content

In this data engineering project, I created an ETL pipeline from point-of-sale (POS) 🏪 data from the Wedge Co-Op in Minneapolis, spanning January 2010 to January 2017. The dataset captures transaction-level details from a member-owned cooperative, with 75% of transactions generated by member-owners, enabling comprehensive shopping pattern analysis

Notifications You must be signed in to change notification settings

JBangtson/Wedge_Project

Repository files navigation


Table of Contents 🍎

  1. Extract: Raw Data Exploration and Metadata
  2. Transform & Load: Cleaning and Uploading to GBQ
  3. Owner Queries
  4. Creating SQLite DB and Tables from GBQ Queries

Summary

In this data engineering project, I created an ETL pipeline from point-of-sale (POS) 🏪 data from the Wedge Co-Op in Minneapolis, spanning January 2010 to January 2017. The dataset captures transaction-level details from a member-owned cooperative, with 75% of transactions generated by member-owners, enabling comprehensive shopping pattern analysis.

The meaning of this file is to summarize the Wedge Project and to check for accuracy in the ETL process. The first section of the project was extracting the data from zip files and creating metadata for exploration. The second section of the project consisted of creating data cleaning functions specific to the project, and transforming and loading the data into GBQ. In the third section, text files were created from GBQ owner-specific queries. In the final section, data was downloaded from GBQ using queries, then the data was loaded into a SQL db.

Task 1

  • Files for this task:

explore_wedge.ipynb: The notebook is structured to facilitate the exploration, summarization, and cleaning of a dataset related to the Wedge project, with a focus on handling CSV files efficiently using Polars.

to_the_cloud.ipynb: The notebook is designed to automate the process of loading, cleaning, and uploading data to BigQuery, ensuring that the data is in the correct format and structure for analysis.

Task 2

  • File for this task:

GBQ_owner_query.ipynb: This notebook automates the process of querying transaction data from BigQuery, handling card owner information, and saving the results for further analysis. It emphasizes data retrieval, processing, and error management in the context of working with large datasets.

Task 3

  • File for this task:

building_summary_tables.ipynb: The notebook automates the process of querying and processing transaction data from BigQuery, focusing on card owner information, and emphasizes data retrieval, processing, and error management in handling large datasets.

Query Comparison Results

Assignment: Fill in the following table with the results from the queries contained in gbq_assessment_query.sql. You only need to fill in relative difference on the rows where it applies. When calculating relative difference, use the formula (your_results - john_results)/john_results).

Query Your Results John's Results Difference Rel. Diff
Total Rows 85,760,124 85,760,139 -15 15
January 2012 Rows 1,070,907 1,070,907 0 0
October 2012 Rows 1,029,592 1,029,592 0 0
Month with Fewest February (2) Yes Yes/No NA
Num Rows in Month with Fewest 6,556,769 6,556,770 -1 1
Month with Most May Yes Yes/No NA
Num Rows in Month with Most 7,578,371 7,578,372 -1 1
Null_TS 485,472 7,123,792 -6,338,320 -6,338,320
Null_DT 0 0 0 0
Null_Local 234,839 234,843 -6 6
Null_CN 0 0 0 0
Num 5 on High Volume Cards 14987 Yes Yes/No NA
Num Rows for Number 5 460,625 460,630 -5 5
Num Rows for 18736 12,153 12,153 0 0
Product with Most Rows Banana Organic Yes Yes/No NA
Num Rows for that Product 908,637 908,639 -2 2
Product with Fourth-Most Rows Avocado Hass Organic Yes Yes/No NA
Num Rows for that Product 456,771 456,771 0 0
Num Single Record Products 2,741 2,769 -28 28
Year with Highest Portion of Owner Rows 2014 Yes Yes/No NA
Fraction of Rows from Owners in that Year 75.91% 75.91% 0% 0%
Year with Lowest Portion of Owner Rows 2011 Yes Yes/No NA
Fraction of Rows from Owners in that Year 73.72% 73.72% 0% 0%

Note: I have such a large difference in Null_TS due to changing the strings to " " instead of NULL. I have 65,065,888 rows of " " in the trans_subtype column; in this context there is a 1,239,440 relative distance with John still having greater NULL_TS.

Reflections

Overall, The Wedge Project was exciting in working with my first cloud database. The experience gave me confidence in my ability to do the ETL process.

The process was messy. Some files were already clean, while others had no column names, Strings in columns with Float datatypes, or delimited with a semi-colon instead of commas, etc.

I wanted to do each task in its loop. For example, I tried to clean and upload all the data to GBQ in a single loop (I hope it is fully automatic by the time this is due, as I wanted). But, sometimes, I would have a chunk of code running for 15 minutes for it to crash, and instead of tweaking the loop to start where it left off, it was easier to make a 'manual' section where I would manually select the file, then clean, and upload. Slowing down saved some money, but I lost time.

After completing the tasks, I had a lot of messy code to clean up, and I found errors in cleaning, which created more mess. I am confident my errors are trivial, if there are any, but I'm still cleaning up and commenting on messy code.

About

In this data engineering project, I created an ETL pipeline from point-of-sale (POS) 🏪 data from the Wedge Co-Op in Minneapolis, spanning January 2010 to January 2017. The dataset captures transaction-level details from a member-owned cooperative, with 75% of transactions generated by member-owners, enabling comprehensive shopping pattern analysis

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published