Table of Contents 🍎

Summary

In this data engineering project, I created an ETL pipeline from point-of-sale (POS) 🏪 data from the Wedge Co-Op in Minneapolis, spanning January 2010 to January 2017. The dataset captures transaction-level details from a member-owned cooperative, with 75% of transactions generated by member-owners, enabling comprehensive shopping pattern analysis.

The meaning of this file is to summarize the Wedge Project and to check for accuracy in the ETL process. The first section of the project was extracting the data from zip files and creating metadata for exploration. The second section of the project consisted of creating data cleaning functions specific to the project, and transforming and loading the data into GBQ. In the third section, text files were created from GBQ owner-specific queries. In the final section, data was downloaded from GBQ using queries, then the data was loaded into a SQL db.

Task 1

Files for this task:

explore_wedge.ipynb: The notebook is structured to facilitate the exploration, summarization, and cleaning of a dataset related to the Wedge project, with a focus on handling CSV files efficiently using Polars.

to_the_cloud.ipynb: The notebook is designed to automate the process of loading, cleaning, and uploading data to BigQuery, ensuring that the data is in the correct format and structure for analysis.

Task 2

File for this task:

GBQ_owner_query.ipynb: This notebook automates the process of querying transaction data from BigQuery, handling card owner information, and saving the results for further analysis. It emphasizes data retrieval, processing, and error management in the context of working with large datasets.

Task 3

File for this task:

building_summary_tables.ipynb: The notebook automates the process of querying and processing transaction data from BigQuery, focusing on card owner information, and emphasizes data retrieval, processing, and error management in handling large datasets.

Query Comparison Results

Assignment: Fill in the following table with the results from the queries contained in gbq_assessment_query.sql. You only need to fill in relative difference on the rows where it applies. When calculating relative difference, use the formula (your_results - john_results)/john_results).

Query	Your Results	John's Results	Difference	Rel. Diff
Total Rows	85,760,124	85,760,139	-15	15
January 2012 Rows	1,070,907	1,070,907	0	0
October 2012 Rows	1,029,592	1,029,592	0	0
Month with Fewest	February (2)	Yes	Yes/No	NA
Num Rows in Month with Fewest	6,556,769	6,556,770	-1	1
Month with Most	May	Yes	Yes/No	NA
Num Rows in Month with Most	7,578,371	7,578,372	-1	1
Null_TS	485,472	7,123,792	-6,338,320	-6,338,320
Null_DT	0	0	0	0
Null_Local	234,839	234,843	-6	6
Null_CN	0	0	0	0
Num 5 on High Volume Cards	14987	Yes	Yes/No	NA
Num Rows for Number 5	460,625	460,630	-5	5
Num Rows for 18736	12,153	12,153	0	0
Product with Most Rows	Banana Organic	Yes	Yes/No	NA
Num Rows for that Product	908,637	908,639	-2	2
Product with Fourth-Most Rows	Avocado Hass Organic	Yes	Yes/No	NA
Num Rows for that Product	456,771	456,771	0	0
Num Single Record Products	2,741	2,769	-28	28
Year with Highest Portion of Owner Rows	2014	Yes	Yes/No	NA
Fraction of Rows from Owners in that Year	75.91%	75.91%	0%	0%
Year with Lowest Portion of Owner Rows	2011	Yes	Yes/No	NA
Fraction of Rows from Owners in that Year	73.72%	73.72%	0%	0%

Note: I have such a large difference in Null_TS due to changing the strings to " " instead of NULL. I have 65,065,888 rows of " " in the trans_subtype column; in this context there is a 1,239,440 relative distance with John still having greater NULL_TS.

Reflections

Overall, The Wedge Project was exciting in working with my first cloud database. The experience gave me confidence in my ability to do the ETL process.

The process was messy. Some files were already clean, while others had no column names, Strings in columns with Float datatypes, or delimited with a semi-colon instead of commas, etc.

I wanted to do each task in its loop. For example, I tried to clean and upload all the data to GBQ in a single loop (I hope it is fully automatic by the time this is due, as I wanted). But, sometimes, I would have a chunk of code running for 15 minutes for it to crash, and instead of tweaking the loop to start where it left off, it was easier to make a 'manual' section where I would manually select the file, then clean, and upload. Slowing down saved some money, but I lost time.

After completing the tasks, I had a lot of messy code to clean up, and I found errors in cleaning, which created more mess. I am confident my errors are trivial, if there are any, but I'm still cleaning up and commenting on messy code.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
assets		assets
.gitignore		.gitignore
01_explore_wedge.ipynb		01_explore_wedge.ipynb
02_to_the_cloud.ipynb		02_to_the_cloud.ipynb
03_GBQ_owner_query.ipynb		03_GBQ_owner_query.ipynb
04_building_summary_tables.ipynb		04_building_summary_tables.ipynb
README.md		README.md
justin-wedge-feedback.md		justin-wedge-feedback.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents 🍎

Summary

Task 1

Task 2

Task 3

Query Comparison Results

Reflections

About

Releases

Packages

Contributors 2

Languages

JBangtson/Wedge_Project

Folders and files

Latest commit

History

Repository files navigation

Table of Contents 🍎

Summary

Task 1

Task 2

Task 3

Query Comparison Results

Reflections

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages