Welcome to my data engineering project - an end-to-end automated pipeline for analyzing electricity prices! 🚀
In this project, I've constructed a robust data pipeline that seamlessly collects, transforms, and analyzes electricity price data sourced directly from online sources. Leveraging Python's Beautiful Soup, I've crafted a powerful web scraper that diligently fetches daily electricity prices across Europe from 2022-01-01 until today.
Behind the scenes, I've engineered a cutting-edge infrastructure that orchestrates this data journey flawlessly. Nestled within my trusty Ubuntu machine, two Docker containers work tirelessly to handle the heavy lifting. The first container houses a PostgreSQL database where our precious data finds its home. Meanwhile, the second container hosts Airflow, our automation maestro, orchestrating daily runs of our DAG at precisely 18:00.
But wait, there's more! Within our Airflow DAG lies a treasure trove of data transformation and cleaning scripts. These scripts ensure our data is pristine and ready for analysis before gracefully loading it into our PostgreSQL database.
Once our data is snugly tucked away in PostgreSQL, it's time to unleash the power of insights! Using PowerBI, I've crafted a series of interactive dashboards that paint a vivid picture of electricity price trends, helping stakeholders glean actionable insights at a glance.
But why stop there? I've taken this project to the next level by implementing sophisticated time-series predictive models. By forecasting tomorrow's electricity prices, we empower decision-makers with foresight, allowing them to stay ahead of the curve in an ever-evolving energy landscape (ARIMA model wins).
This isn't just a project - it's a testament to my prowess as a data engineer. By seamlessly integrating cutting-edge technologies and methodologies, I've crafted a comprehensive end-to-end solution that not only automates data processing but also delivers actionable insights and predictive capabilities. From web scraping to predictive modeling, this project showcases my ability to tackle complex challenges head-on and deliver tangible value.
You might wonder why I opted not to utilize a cloud provider's virtual machine. The reason lies in the scale of our data. With less than a million rows to manage so far, I chose to leverage the computational power of my local machine and Ubuntu setup to minimize project costs. Additionally, this decision positions us for easy migration to the cloud in the future, ensuring scalability without sacrificing efficiency.