ETL Data Pipeline with Kaggle, Airflow, Celery, PostgreSQL, Google Cloud Storage, BigQuery and Terraform.
I wanted to build a pipeline that handles large amounts of images and their metadata seemlessly. I imagined supporting a team of machine learning engineers who would need a pipeline with standardized images to feed a developing model. The focus of this project was to sharpen my image parsing skills with python and writing refactorable code for similar purposes in the future. In future projects I will attempt to document my thought-process timeline in a dev_logs file.
This pipeline is designed to:
- Extract data from Kaggle using its API
- Transform the raw data by generating metadata, standardizing images, creating greyscale copies and changing column types
- Upload the data to GCS using folder oriented and flexible functions, and load the transformed data into BigQuery.
- Perform aggregations on the image data and it's metadata.
- Kaggle API: Source of the data.
- Apache Airflow & Celery: Orchestrates the ETL process and manages task distribution.
- PostgreSQL: Temporary storage and metadata management.
- Google Cloud Storage(GCS): Raw data storage.
- Google BigQuery(BQ): Data warehousing, analytics and SQL-based data transformation.
I was able to learn the following new things as a result of building this pipeline:
- Airflow's PythonOperator is flexible enough to dynamically create tasks based on input configurations. The ability to loop through a list of folder paths and generate individual tasks for each folder upload to GCS was an efficient approach I hadn't considered before.
- While working on this pipeline, I found that breaking down tasks into smaller, more focused functions really helped me stay on top of the development process. By using function decomposition, I was able to write cleaner, more manageable code where each function had a single responsibility. This made debugging, testing, and extending the code much easier, and overall improved how I approached the project.
- I recently learned about the use of
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
during development, and it was a game changer for managing imports across different files. This line allows me to modify the Python path dynamically, making it easier to call functions from other files without running into import errors. It’s a small addition, but it really streamlined how I structure and reference my code in larger projects.
-
One thing I realized while working on this project is the functional difference between declaring volumes in the Docker Compose file and copying them in the Dockerfile. By doing this, I can link external files or directories to the container at runtime, which means I can modify data or configurations without having to rebuild the container. This let me save time, especially when I needed to make configuration changes or work with persistent data.
-
By avoiding unnecessary packages like gosu and vim, I saw that I can keep the container lightweight, reducing its attack surface and improving its overall performance. It reminded me that containers should be as simple as possible, with everything needed for production tasks and not much else.
- Google console account with appropriate permissions for GCS and BigQuery
- Kaggle API Credentials
- Terraform installation
- Docker Installation
- Python 3.11 or higher
-
Clone the repository
git clone https://github.com/Shegzimus/DE_Fashion_Product_Images.git
-
Create a virtual environment in your local machine
python3 -m venv venv
-
Activate the virtual environment
source venv/bin/activate
-
Install dependencies
pip install -r airflow/requirements.txt
-
Create directories to store your google and kaggle credentials
cd airflow && mkdir -p .google .kaggle
-
Generate and place your google service account credentials in the .google file the terraform file
-
Generate and place your kaggle credentials in the .kaggle file
-
Adjust the DockerFile and docker-compose configuration to reflect these changes
-
Build the Docker Image
docker build -d --
-
Start the Docker containers
docker-compose up -d
-
Launch the Airflow web UI
open http://localhost:8081