Skip to content

Automating data engineering for a small Google Sheets dataset using Pandas and Github Actions.

Notifications You must be signed in to change notification settings

marliyehez/minimalistic-google-spreadsheet-etl

Repository files navigation

Music Tutoring Data Engineering Project

This project focuses on data engineering for a relatively small dataset related to a music tutoring service. The main objective is to process the data for visualization purposes.

The data, which is manually input by an admin, is sourced from Google Spreadsheets. The processed data will be visualized using Looker Studio. Given the small size of the data (less than 1000 rows), we use pandas for ETL (Extract, Transform, Load) operations. The aggregated results for Looker are stored in the same spreadsheet but on a new sheet. The pandas DataFrame acts as a temporary data warehouse that is cleared with each run. The ETL process is scheduled using GitHub Actions to run at the beginning of each month.

Additionally, unit tests are included and executed using pytest.

Data Source

As previously mentioned, the data for this project is sourced from a Google Spreadsheet. The following link provides access to the spreadsheet:

[ Google Spreadsheet Link ]

❗Please note that the data in the spreadsheet is dummy data that has been modified for the purposes of this project.

Tech Stack

  • Pandas: For ETL operations and data manipulation.
  • Looker Studio: For data visualization.
  • GitHub Actions: For scheduling and automation of ETL processes.
  • Google Cloud: For Google Spreadsheets and service accounts.

Dashboard

Here is the link to the Looker Studio dashboard built using the processed data:

[ Looker Studio Dashboard ]

How to reproduce this project?

  1. Clone the repository:
git clone https://github.com/marliyehez/minimalistic-google-spreadsheet-etl.git
cd minimalistic-google-spreadsheet-etl
  1. Install the required Python packages:
pip install -r requirements.txt
  1. Create a service account that has access to read and write Google Spreadsheets and download the JSON key file.

  2. Set the following environment variables:

    • sheet_url: The URL of the Google spreadsheet that contains the raw data.
    • service_account_path: The path to the service account JSON key file.
    • recap_sheet_title: The title of the new sheet in the Google spreadsheet that will store the aggregated data for Looker Studio.
    • month: (Optional) The month for which to generate the data. If not provided, the current month will be used. Specify the month using Indonesian month names, such as "Januari", "Februari", "Maret", etc.
  3. Running ETL Locally To run the ETL process locally, execute:

python main.py
  1. Running Tests To run the unit tests, use:
pytest -v
  1. Scheduling ETL with GitHub Actions The ETL process is scheduled to run at the beginning of each month using GitHub Actions. The workflow configuration is located in github/workflows/actions.yml.

  2. Open Looker Studio and create a new Looker Studio report.

  3. Connect to the Google Sheets data source and select the new sheet that was created by the ETL script.

  4. Create your visualizations.

About

Automating data engineering for a small Google Sheets dataset using Pandas and Github Actions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages