This project focuses on predicting the sales of various products in Big Mart outlets. It involves analyzing historical sales data, exploring key features influencing sales, and building machine learning models for accurate sales prediction. The objective is to assist Big Mart in optimizing inventory management and maximizing sales revenue.
The dataset consists of two files: Train.csv
containing the training data and Test.csv
for testing. It includes various features such as Item ID, Item Weight, Item Fat Content, Item Type, Outlet ID, Outlet Establishment Year, Outlet Size, Outlet Location Type, and Item Outlet Sales (target variable).
Train.csv
: Contains the training dataset.Test.csv
: Contains the testing dataset.Big_Mart_Sales_Prediction.ipynb
: Jupyter Notebook containing data preprocessing, exploratory data analysis, model building, and evaluation.models/
: Directory containing saved trained models.random_forest_grid.sav
: Saved Random Forest model after hyperparameter tuning.sc.sav
: Saved Standard Scaler object for preprocessing.requirements.txt
: File listing the Python packages required to run the notebook.
- Handled missing values in Item Weight and Outlet Size using mean and mode imputation respectively.
- Performed label encoding for categorical variables.
- Split the data into training and testing sets.
- Analyzed correlations between features using heatmap visualization.
- Utilized Pandas Profiling and Klib libraries for in-depth data exploration.
- Visualized distributions and missing values to gain insights into the data.
- Trained machine learning models including Linear Regression, Random Forest Regressor, and XGBoost Regressor.
- Conducted hyperparameter tuning using GridSearchCV to optimize model performance.
- Evaluated models using metrics like R-squared, Mean Absolute Error, and Root Mean Squared Error.
- Install the required dependencies listed in
requirements.txt
usingpip install -r requirements.txt
. - Run the Jupyter Notebook
Big_Mart_Sales_Prediction.ipynb
to execute the code. - Ensure the dataset files
Train.csv
andTest.csv
are in the appropriate directory.
- Achieved significant improvement in model performance after hyperparameter tuning.
- Random Forest Regressor yielded the best performance with an R-squared score of over 0.60.
- Explore additional feature engineering techniques to enhance model performance.
- Experiment with advanced machine learning algorithms such as Gradient Boosting Machines and Neural Networks.
- Deploy the best-performing model into production for real-time sales predictions.
- Mayur Kyatham
- Utsav Kuntalwad
- Prerna Shakwar
- Srushti Sawant
This project is licensed under the MIT License.