Using a dataset from the UCI Machine Learning Repository that contained information about hourly and daily counts of rental bikes in Washington DC, I wanted to build a model that could predict the number of rental bikes being used.
The tools, libraries and models that I used for this project were:
- Pandas
- NumPy
- Matplotlib
- Seaborn
- Plotly
- Feature Scaling (MinMaxScaler and StandardScaler)
- Linear Regression (OLS)
- Poission Regression Model
- Negative Binomial (NB) Regression Model
Before creating machine learning models for this project, I did some EDA as well as data cleaning and data preprocessing:
- Dropping unnecessary columns;
- Identifying columns with high correlation between each other;
- Encoding categorical variables;
- Feature Scaling.
Several approaches were applied for this project, in order to improve the Adjusted R-Squared and reduce the Mean Squared Error. Of all the models that were built, I would mention the Poisson Regression Model and the Negative Binomial Regression Model as the ones with the most interesting results. These two models are the most appropriate for these kinds of datasets, where you are trying to predict count-based data on a certain time period. In this case, we are trying to predict the number of bikes per hour.
You can check the python file for more information on these models and how they were built.