Music Tutoring Data Engineering Project

This project focuses on data engineering for a relatively small dataset related to a music tutoring service. The main objective is to process the data for visualization purposes.

The data, which is manually input by an admin, is sourced from Google Spreadsheets. The processed data will be visualized using Looker Studio. Given the small size of the data (less than 1000 rows), we use pandas for ETL (Extract, Transform, Load) operations. The aggregated results for Looker are stored in the same spreadsheet but on a new sheet. The pandas DataFrame acts as a temporary data warehouse that is cleared with each run. The ETL process is scheduled using GitHub Actions to run at the beginning of each month.

Additionally, unit tests are included and executed using pytest.

Data Source

As previously mentioned, the data for this project is sourced from a Google Spreadsheet. The following link provides access to the spreadsheet:

[ Google Spreadsheet Link ]

❗Please note that the data in the spreadsheet is dummy data that has been modified for the purposes of this project.

Tech Stack

Pandas: For ETL operations and data manipulation.
Looker Studio: For data visualization.
GitHub Actions: For scheduling and automation of ETL processes.
Google Cloud: For Google Spreadsheets and service accounts.

Dashboard

Here is the link to the Looker Studio dashboard built using the processed data:

[ Looker Studio Dashboard ]

How to reproduce this project?

Clone the repository:

git clone https://github.com/marliyehez/minimalistic-google-spreadsheet-etl.git
cd minimalistic-google-spreadsheet-etl

Install the required Python packages:

pip install -r requirements.txt

Create a service account that has access to read and write Google Spreadsheets and download the JSON key file.
Set the following environment variables:
- sheet_url: The URL of the Google spreadsheet that contains the raw data.
- service_account_path: The path to the service account JSON key file.
- recap_sheet_title: The title of the new sheet in the Google spreadsheet that will store the aggregated data for Looker Studio.
- month: (Optional) The month for which to generate the data. If not provided, the current month will be used. Specify the month using Indonesian month names, such as "Januari", "Februari", "Maret", etc.
Running ETL Locally To run the ETL process locally, execute:

python main.py

Running Tests To run the unit tests, use:

pytest -v

Scheduling ETL with GitHub Actions The ETL process is scheduled to run at the beginning of each month using GitHub Actions. The workflow configuration is located in github/workflows/actions.yml.
Open Looker Studio and create a new Looker Studio report.
Connect to the Google Sheets data source and select the new sheet that was created by the ETL script.
Create your visualizations.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
config		config
img		img
tests		tests
utils		utils
.gitignore		.gitignore
functions.py		functions.py
logger_config.py		logger_config.py
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt
status.log		status.log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Music Tutoring Data Engineering Project

Data Source

Tech Stack

Dashboard

How to reproduce this project?

About

Releases

Packages

Languages

marliyehez/minimalistic-google-spreadsheet-etl

Folders and files

Latest commit

History

Repository files navigation

Music Tutoring Data Engineering Project

Data Source

Tech Stack

Dashboard

How to reproduce this project?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages