Language: Python
Introduction: This project is centred around a Kaggle competition, and seeks to use Machine Learning techniques to create a recommender system. In this competition, H&M Group have invited contributors to make product recommendations based on data from previous transactions and from customer and product meta data. The task is to predict twelve product article_ids that each customer might purchase during the seven-day period immediately after the two-year transaction data provided.
I decided to split the historical transaction data into two timeframes, using the earlier data to generate information about each of the customers and to build an analytical base table, and using the most popular articles purchased by customers in the later data as target labels for model training purposes. In terms of Machine Learning, I developed a twin approach. The first approach set out to predict later article_id purchases based on previous purchases at the same granularity, article_id, as this would certainly capture the most information. Ultimately, this strategy came at too high a computational cost even using Kaggle's public tier or Google Colab. Instead, I sought to predict article_id purchases based on previous purchases measured at a coarser grain. However, the pipeline used for this coarse grain data can easily be adapted to the finer grain data given access to more powerful machinery. The second approach sought to aggregate information about each customer based on their previous purchases, engineering features such as favourite colour and favourite pattern. Unfortunately, neither model proved to have much predictive power on their own as they were unable to beat a baseline non-personalised prediction for all customers, a simple prediction of the most popular twelve articles from the final week of data (this prediction carried a score of 0.0056 on Kaggle). Ultimately, I decided to combine these approaches into a hybrid model. Happily, the hybrid model works! It is able to beat the baseline score by assigning better individual recommendations to each customer from among the most popular articles.
Data Files: There are three primary datasets to contend with: articles.csv, customers.csv, and transactions_train.csv. The articles.csv contains information relating to the articles that are available for sale at H&M, the customers.csv file contains information on each customer in the database, while transactions_train.csv contains all of the purchases over a two year period from September 2018 to September 2020. The grain of the purchasing data is per customer per article. There is also a fourth data set provided, sample_submission.csv, which contains a sample submission. These files are too large to include here, but can be found at the link below.
Kaggle Competition: https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/overview