Home Shopping project

I have a tendency of collecting my shopping receipts. In May 2020, I decided to build a database of these shopping receipts. This database is called Home Shopping. It has provided me with a useful way not only of keeping an eye on my expenditure but also gaining insight into my consumption habbits:

where do I spend most of my money
How much have I spent in each venue overall
when do I spend most of my money
what do I spend most of my money on
what products do I buy most
How much do I spend per week, per month, per year
How many items do I buy per week, per month, per year
etc ...

Having gathered several years worth of data, I wanted to apply machine learning to this data.
The project involves several tables from the database

Receipt table - This contains summary receipt data, total price, total number of items receipt date, receipt time and shopping venue
Payment table - This contains payment information e.g. payment type; cash, card, plan Card_Source; Contactless, Pin, 0,DB. DD, Transfer
Item table - This contains items/product information e.g. item name and item price

Problem statement

I need a set of tools that guide my expenditure such that I feel more in control of my expenditure while saving time spent shopping.

I have a small fridge so I don't tend to bulk buy and consume over a longer period. I buy small quantities, as a result I do many shopping trips in a week, which consumes a very important resource, time.

In additions, I go through phases where I buy lots of things in a short period of time whether online or in store. This reflects negatively on my budget.

With respect to the actually expenditure, I don't have a consistent expenditure pattern i.e. there are significant variations/variance between expenditure on similar shopping trips.

I want to smooth the shopping experience: I want to reduce the time spent shopping, the number of trips I do per week; I want to reduce the variability in expenditure using a planning tool that provides a good estimate of expenditure given shopping list.

These tools will be used in combination to optimize the shopping experience: reduce time spent shopping and stabilize expenditure.

A problem on the nature of time series: Data

Michael Burry did not have a time series/forcasting model that predicted an imenent stock market crash in 2008. Instead, Michael stumbled onto data which indicated a serious problem in the subprime lending, the rest is common sense.

Back in 2006-2007 no financial forcasting models successfully predicted the crash in 2008:
they didn't have the data that indicated a serious problem in the underlying assets, subprime lending.

Forecasting models do well when there is data supporting the trend:
you can see a wave building by the sea side and follow it until it collapses but you can't predict where and when it will collapse with certainty; nore where and when the next one will arise with certainty, unless you have multiple detectors beneath the waters.

But you know in advance that one will arise, eventually.

A problem on the nature of time series: Algorithms

Tree based algorithms such as random forest and XGBoost employ bootstrapping and are also used in combination with cross validation. Bootstrapping process involves randomly sampling from the data the model is trained on. Thus, within the training set, relatively future data is used to predict relatively past data i.e. data leakage.

Cross validation is a powerful technique used in model validation; in splitting the data into k-folds each fold is used once as testing set which means that the earlier folds, relative past data, will be used to validate later folds, relative future data i.e. data that wouldn't otherwise be available at the time, will be used for model training. Once again, this is a form of data leakage.

Developing time series models with these techniques is inconsistent with the intuition behind train test split for time series. Given the much spoken success of these models, time series considerations when splitting the data into train and test should be ignored in general as such consideration is already violated in training the models.

So the question, is time series real data science?

What's inside

1. Data extraction

Inside the Data Extraction folder the following tasks are accomplished:

Data extraction from database
New features creation (preliminary feature engineering)
Classifier target/Label definition
Exporting of the raw data to csv format

2. Data

Inside the data folder you will find all of the datasets used in the project.
Individual notebooks will read in or create these datasets

3. Exploratory data analysis

Exploring the distributions of the features.

4. Feature engineering

In here the following tasks are accomplished:

Classifier feature engineering

Creating dummy variables with pd.get_dummies
Split the data into training and testing
Scale the values using StandardScaler
Feature selection removing low variance features
Feature selection removing correlated features
Illustrate the features capacity to distinguish between the target classes
Export data for modelling

Regressor feature engineering

Creating dummy variables with pd.get_dummies
Split the data into training and testing
Scale the values using StandardScaler
Feature selection removing low variance features
Feature selection removing correlated features
Export data for modelling

5. Modelling

In here is the classification and regression model training process. This includes:

Classifier model training

Balancing the data with imblearn
Hyperparameter tuning with GridSearchCV
Model comparison: imbalanced vs balanced
Final feature selection using selectfrom
Final model evaluation with confusion matrix

Regressor model training

Hyperparameter tuning with GridSearchCV
Retrieving the best parameters for the top 4 models from GridSearchCV
Comparing VotingRegressor with the best model using cross validation
Working with the best model
Feature importance of the best model
Exporting the best model for evaluation

6. Model evaluation

In here is the classifier and regressor model evaluations. This includes:

Classifier model evaluation

Evaluating the CatBoost model
Comparing the CatBoost with the Random forest model
Visualizing Predicted probabilities by class attribute
Visualizing the distribution of predicted probability of class 1
Sensitivity threshold tuning
Receiver operating Characteristic Curve (ROC Curves) and Area Under the Curve (AUC)
Selecting Sensitivity and Specificity from ROC using a function
Visualizing sensitivity vs specificity threshold ranges

Regressor model evaluation

Final feature selection using selectfrom on the best model
Final model training
Validate the final model
Feature importance top features model
Exporting the evaluated best model

7. Explain the models with SHAP

In here the models are explained from a global and local perspective. This includes:

Classifier model explaination

Global fidelity: An explaination of the positive and negative relationship between the features and the target from a wholistic model point of view
Local fidelity: An explaination of how the model behaves for a single prediction i.e. the feature by feature contribution to the prediction

Regressor model explaination

Global fidelity: An explaination of the positive and negative relationship between the features and the target from a wholistic model point of view
Local fidelity: An explaination of how the model behaves for a single prediction i.e. the feature by feature contribution to the prediction

8. Models

In here all the classifier, regressor and StandardScaler models are stored ready for use.

9. Deployment pipeline

Inside is all the modules required to generate a new classifier and regressor prediction and to explain the predictions.

This includes:

Import the pipeline module output_pipeline to preprocess the data and generate the predictions and shap values.
Import the fidelity module local_fidelity to generate the local explaination plots for the classifier and regressor models.
Display the retrieved data using print()

10. Monitoring

This is where mock monitoring scenarios are developed to mimick real situations
where after deployment two monitoring concerns are considered: no data drift and data drift.
This includes:

Generating classifier datasets

Create classifier reference dataset and append classifier predictions to it.
Create classifier current dataset no drift and append classifier predictions to it.
Create classifier current dataset with drift and append classifier predictions to it.
Run various evidently preset reports and export them to Reports folder

Generating regressor datasets

Create regressor reference dataset and append regressor predictions to it.
Create regressor current dataset no drift and append regressor predictions to it.
Create regressor current dataset with drift and append regressor predictions to it.
Run various evidently preset reports and export them to Reports folder

11. Reports

In hear Evidently AI monitoring reports are stored.This includes:

Data quality reports
Data drift reports
Model performance reports

Next step

Looking forward, the objective is to:

Rebuild the models to include important features that were left out of the SelectFrom algorithm
Consider regularization parameters for the models, non has been specified thus far, thus default was applied.
Add a data dictionary
Develop time series models for comparison
Build deep learning versions of the models for comparison.
To fully deploy the models with AWS

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
1. Data Extraction		1. Data Extraction
10. Monitoring		10. Monitoring
11. Reports/evidently_metric_presets		11. Reports/evidently_metric_presets
2. Data		2. Data
3. Exploratory data analysis		3. Exploratory data analysis
4. Feature engineering		4. Feature engineering
5. Modelling		5. Modelling
6. Model evaluation		6. Model evaluation
7. Explaining the models with SHAP		7. Explaining the models with SHAP
8. Models		8. Models
9. Deployment pipeline		9. Deployment pipeline
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Home Shopping project

Problem statement

A problem on the nature of time series: Data

A problem on the nature of time series: Algorithms

What's inside

1. Data extraction

2. Data

3. Exploratory data analysis

4. Feature engineering

5. Modelling

6. Model evaluation

7. Explain the models with SHAP

8. Models

9. Deployment pipeline

10. Monitoring

11. Reports

Next step

About

Releases

Packages

Languages

ManunEbo/Home-Shopping-ML

Folders and files

Latest commit

History

Repository files navigation

Home Shopping project

Problem statement

A problem on the nature of time series: Data

A problem on the nature of time series: Algorithms

What's inside

1. Data extraction

2. Data

3. Exploratory data analysis

4. Feature engineering

5. Modelling

6. Model evaluation

7. Explain the models with SHAP

8. Models

9. Deployment pipeline

10. Monitoring

11. Reports

Next step

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages