- Extract: Raw Data Exploration and Metadata
- Transform & Load: Cleaning and Uploading to GBQ
- Owner Queries
- Creating SQLite DB and Tables from GBQ Queries
In this data engineering project, I created an ETL pipeline from point-of-sale (POS) 🏪 data from the Wedge Co-Op in Minneapolis, spanning January 2010 to January 2017. The dataset captures transaction-level details from a member-owned cooperative, with 75% of transactions generated by member-owners, enabling comprehensive shopping pattern analysis.
The meaning of this file is to summarize the Wedge Project and to check for accuracy in the ETL process. The first section of the project was extracting the data from zip files and creating metadata for exploration. The second section of the project consisted of creating data cleaning functions specific to the project, and transforming and loading the data into GBQ. In the third section, text files were created from GBQ owner-specific queries. In the final section, data was downloaded from GBQ using queries, then the data was loaded into a SQL db.
- Files for this task:
explore_wedge.ipynb
:
The notebook is structured to facilitate the exploration, summarization, and cleaning of a dataset related to the Wedge project, with a focus on handling CSV files efficiently using Polars.
to_the_cloud.ipynb
:
The notebook is designed to automate the process of loading, cleaning, and uploading data to BigQuery, ensuring that the data is in the correct format and structure for analysis.
- File for this task:
GBQ_owner_query.ipynb
:
This notebook automates the process of querying transaction data from BigQuery, handling card owner information, and saving the results for further analysis. It emphasizes data retrieval, processing, and error management in the context of working with large datasets.
- File for this task:
building_summary_tables.ipynb
:
The notebook automates the process of querying and processing transaction data from BigQuery, focusing on card owner information, and emphasizes data retrieval, processing, and error management in handling large datasets.
Assignment: Fill in the following table with the results from the
queries contained in gbq_assessment_query.sql
. You only
need to fill in relative difference on the rows where it applies.
When calculating relative difference, use the formula
(your_results - john_results)/john_results)
.
Query | Your Results | John's Results | Difference | Rel. Diff |
---|---|---|---|---|
Total Rows | 85,760,124 | 85,760,139 | -15 | 15 |
January 2012 Rows | 1,070,907 | 1,070,907 | 0 | 0 |
October 2012 Rows | 1,029,592 | 1,029,592 | 0 | 0 |
Month with Fewest | February (2) | Yes | Yes/No | NA |
Num Rows in Month with Fewest | 6,556,769 | 6,556,770 | -1 | 1 |
Month with Most | May | Yes | Yes/No | NA |
Num Rows in Month with Most | 7,578,371 | 7,578,372 | -1 | 1 |
Null_TS | 485,472 | 7,123,792 | -6,338,320 | -6,338,320 |
Null_DT | 0 | 0 | 0 | 0 |
Null_Local | 234,839 | 234,843 | -6 | 6 |
Null_CN | 0 | 0 | 0 | 0 |
Num 5 on High Volume Cards | 14987 | Yes | Yes/No | NA |
Num Rows for Number 5 | 460,625 | 460,630 | -5 | 5 |
Num Rows for 18736 | 12,153 | 12,153 | 0 | 0 |
Product with Most Rows | Banana Organic | Yes | Yes/No | NA |
Num Rows for that Product | 908,637 | 908,639 | -2 | 2 |
Product with Fourth-Most Rows | Avocado Hass Organic | Yes | Yes/No | NA |
Num Rows for that Product | 456,771 | 456,771 | 0 | 0 |
Num Single Record Products | 2,741 | 2,769 | -28 | 28 |
Year with Highest Portion of Owner Rows | 2014 | Yes | Yes/No | NA |
Fraction of Rows from Owners in that Year | 75.91% | 75.91% | 0% | 0% |
Year with Lowest Portion of Owner Rows | 2011 | Yes | Yes/No | NA |
Fraction of Rows from Owners in that Year | 73.72% | 73.72% | 0% | 0% |
Note: I have such a large difference in Null_TS due to changing the strings to " " instead of NULL. I have 65,065,888 rows of " " in the trans_subtype column; in this context there is a 1,239,440 relative distance with John still having greater NULL_TS.
Overall, The Wedge Project was exciting in working with my first cloud database. The experience gave me confidence in my ability to do the ETL process.
The process was messy. Some files were already clean, while others had no column names, Strings in columns with Float datatypes, or delimited with a semi-colon instead of commas, etc.
I wanted to do each task in its loop. For example, I tried to clean and upload all the data to GBQ in a single loop (I hope it is fully automatic by the time this is due, as I wanted). But, sometimes, I would have a chunk of code running for 15 minutes for it to crash, and instead of tweaking the loop to start where it left off, it was easier to make a 'manual' section where I would manually select the file, then clean, and upload. Slowing down saved some money, but I lost time.
After completing the tasks, I had a lot of messy code to clean up, and I found errors in cleaning, which created more mess. I am confident my errors are trivial, if there are any, but I'm still cleaning up and commenting on messy code.