This repository contains a Machine Learning project where various classification models were applied to predict online shoppers' purchase intentions based on a dataset. The dataset provides information about online shoppers' behavior and includes a variety of features that can be used to predict whether a visitor will make a purchase (revenue) or not.
The dataset consists of the following information:
- Rows: 12,330
- Columns: 18
- Administrative (Numerical)
- Administrative_Duration (Numerical)
- Informational (Numerical)
- Informational_Duration (Numerical)
- ProductRelated (Numerical)
- ProductRelated_Duration (Numerical)
- BounceRates (Numerical)
- ExitRates (Numerical)
- PageValues (Numerical)
- SpecialDay (Numerical)
- Month (Categorical)
- OperatingSystems (Numerical)
- Browser (Numerical)
- Region (Numerical)
- TrafficType (Numerical)
- VisitorType (Categorical)
- Weekend (Boolean)
- Revenue (Boolean)
The independent variables used for prediction include:
- Administrative
- Administrative_Duration
- Informational
- Informational_Duration
- ProductRelated
- ProductRelated_Duration
- BounceRates
- ExitRates
- PageValues
- SpecialDay
- Month
- OperatingSystems
- Browser
- Region
- TrafficType
- VisitorType
- Weekend
The target variable for prediction is "Revenue."
Data visualization was performed to understand the relationships between various features and the target variable. Visualizations included graphs to explore the relationship between "Month" and "Revenue," "Weekend" and "Revenue," and "VisitorType" versus ("Administrative_Duration," "Informational_Duration," "ProductRelated_Duration").
Categorical data were encoded to make them suitable for machine learning models.
Feature scaling was applied using the StandardScaler to standardize numerical features.
The dataset was split into a training set (70%) and a test set (30%) for model evaluation.
Several classification models were applied to predict online shoppers' purchase intentions:
- Training Accuracy: 0.89
- Testing Accuracy: 0.87
- Training Accuracy: 0.81
- Testing Accuracy: 0.79
- Training Accuracy: 0.89
- Testing Accuracy: 0.87
- Training Accuracy: 0.87
- Testing Accuracy: 0.86
- Training Accuracy: 0.89
- Testing Accuracy: 0.87
- Training Accuracy: 0.90
- Testing Accuracy: 0.88
- Training Accuracy: 0.90
- Testing Accuracy: 0.89
The models were evaluated using both training and test data to assess their performance. The Random Forest classification model emerged as the best-performing model with the following evaluation metrics on the test set:
-
Confusion Matrix:
[[2957 120] [ 266 356]]
-
Accuracy Score: 0.8956
-
Classification Report:
precision recall f1-score support 0 0.92 0.96 0.94 3077 1 0.75 0.57 0.65 622 accuracy 0.90 3699 macro avg 0.83 0.77 0.79 3699 weighted avg 0.89 0.90 0.89 3699
The Random Forest model achieved the highest accuracy and a reasonable balance between precision and recall for both classes, making it the preferred model for predicting online shoppers' purchase intentions.
Feel free to explore the code and the dataset to gain a deeper understanding of the analysis and model implementation.