- where do I spend most of my money
- How much have I spent in each venue overall
- when do I spend most of my money
- what do I spend most of my money on
- what products do I buy most
- How much do I spend per week, per month, per year
- How many items do I buy per week, per month, per year
- etc ...
Having gathered several years worth of data, I wanted to apply machine learning to this data.
The project involves several tables from the database
- Receipt table - This contains summary receipt data, total price, total number of items receipt date, receipt time and shopping venue
- Payment table - This contains payment information e.g. payment type; cash, card, plan Card_Source; Contactless, Pin, 0,DB. DD, Transfer
- Item table - This contains items/product information e.g. item name and item price
I need a set of tools that guide my expenditure such that I feel more in control of
my expenditure while saving time spent shopping.
I have a small fridge so I don't tend to bulk buy and consume over a longer period.
I buy small quantities, as a result I do many shopping trips in a week, which consumes
a very important resource, time.
In additions, I go through phases where I buy lots of things in a short period of time
whether online or in store. This reflects negatively on my budget.
With respect to the actually expenditure, I don't have a consistent expenditure pattern
i.e. there are significant variations/variance between expenditure on similar shopping trips.
I want to smooth the shopping experience: I want to reduce the time spent shopping, the number of
trips I do per week; I want to reduce the variability in expenditure using a planning tool that
provides a good estimate of expenditure given shopping list.
These tools will be used in combination to optimize the shopping experience: reduce time spent shopping and stabilize expenditure.
Michael Burry did not have a time series/forcasting model that predicted an imenent
stock market crash in 2008. Instead, Michael stumbled onto data which indicated
a serious problem in the subprime lending, the rest is common sense.
Back in 2006-2007 no financial forcasting models successfully predicted the crash in 2008:
they didn't have the data that indicated a serious problem in the underlying assets, subprime lending.
Forecasting models do well when there is data supporting the trend:
you can see a wave building by the sea side and follow it until it collapses but you can't
predict where and when it will collapse with certainty; nore where and when the next one will arise
with certainty, unless you have multiple detectors beneath the waters.
But you know in advance that one will arise, eventually.
Cross validation is a powerful technique used in model validation; in splitting the data into k-folds each fold is used once as testing set
which means that the earlier folds, relative past data, will be used to validate later folds, relative future data i.e. data that wouldn't
otherwise be available at the time, will be used for model training. Once again, this is a form of data leakage.
Developing time series models with these techniques is inconsistent with the intuition behind train test split for time series.
Given the much spoken success of these models, time series considerations when splitting the data into train and test should be ignored in general
as such consideration is already violated in training the models.
So the question, is time series real data science?
Inside the Data Extraction folder the following tasks are accomplished:
- Data extraction from database
- New features creation (preliminary feature engineering)
- Classifier target/Label definition
- Exporting of the raw data to csv format
Inside the data folder you will find all of the datasets used in the project.
Individual notebooks will read in or create these datasets
Exploring the distributions of the features.
In here the following tasks are accomplished:
- Classifier feature engineering
- Creating dummy variables with pd.get_dummies
- Split the data into training and testing
- Scale the values using StandardScaler
- Feature selection removing low variance features
- Feature selection removing correlated features
- Illustrate the features capacity to distinguish between the target classes
- Export data for modelling
- Regressor feature engineering
- Creating dummy variables with pd.get_dummies
- Split the data into training and testing
- Scale the values using StandardScaler
- Feature selection removing low variance features
- Feature selection removing correlated features
- Export data for modelling
In here is the classification and regression model training process. This includes:
- Classifier model training
- Balancing the data with imblearn
- Hyperparameter tuning with GridSearchCV
- Model comparison: imbalanced vs balanced
- Final feature selection using selectfrom
- Final model evaluation with confusion matrix
- Regressor model training
- Hyperparameter tuning with GridSearchCV
- Retrieving the best parameters for the top 4 models from GridSearchCV
- Comparing VotingRegressor with the best model using cross validation
- Working with the best model
- Feature importance of the best model
- Exporting the best model for evaluation
In here is the classifier and regressor model evaluations. This includes:
- Classifier model evaluation
- Evaluating the CatBoost model
- Comparing the CatBoost with the Random forest model
- Visualizing Predicted probabilities by class attribute
- Visualizing the distribution of predicted probability of class 1
- Sensitivity threshold tuning
- Receiver operating Characteristic Curve (ROC Curves) and Area Under the Curve (AUC)
- Selecting Sensitivity and Specificity from ROC using a function
- Visualizing sensitivity vs specificity threshold ranges
- Regressor model evaluation
- Final feature selection using selectfrom on the best model
- Final model training
- Validate the final model
- Feature importance top features model
- Exporting the evaluated best model
In here the models are explained from a global and local perspective. This includes:
- Classifier model explaination
- Global fidelity: An explaination of the positive and negative relationship between the features and the target from a wholistic model point of view
- Local fidelity: An explaination of how the model behaves for a single prediction i.e. the feature by feature contribution to the prediction
- Regressor model explaination
- Global fidelity: An explaination of the positive and negative relationship between the features and the target from a wholistic model point of view
- Local fidelity: An explaination of how the model behaves for a single prediction i.e. the feature by feature contribution to the prediction
In here all the classifier, regressor and StandardScaler models are stored ready for use.
Inside is all the modules required to generate a new classifier and regressor prediction and to explain the predictions.
This includes:
- Import the pipeline module output_pipeline to preprocess the data and generate the predictions and shap values.
- Import the fidelity module local_fidelity to generate the local explaination plots for the classifier and regressor models.
- Display the retrieved data using print()
This is where mock monitoring scenarios are developed to mimick real situations
where after deployment two monitoring concerns are considered: no data drift and data drift.
This includes:
- Generating classifier datasets
- Create classifier reference dataset and append classifier predictions to it.
- Create classifier current dataset no drift and append classifier predictions to it.
- Create classifier current dataset with drift and append classifier predictions to it.
- Run various evidently preset reports and export them to Reports folder
- Generating regressor datasets
- Create regressor reference dataset and append regressor predictions to it.
- Create regressor current dataset no drift and append regressor predictions to it.
- Create regressor current dataset with drift and append regressor predictions to it.
- Run various evidently preset reports and export them to Reports folder
In hear Evidently AI monitoring reports are stored.This includes:
- Data quality reports
- Data drift reports
- Model performance reports
- Rebuild the models to include important features that were left out of the SelectFrom algorithm
- Consider regularization parameters for the models, non has been specified thus far, thus default was applied.
- Add a data dictionary
- Develop time series models for comparison
- Build deep learning versions of the models for comparison.
- To fully deploy the models with AWS