Low Light Photography of Books by Suzy Hazelwood | Pexels Licensed
The goal of this project is to develop a recommendation system that provides a list of 10 books that are similar to a book that a customer has read. This project will implement a collaborative-based filtering method via scikit learn's K-Nearest Neighbours clustering algorithm using the Amazon books dataset. The books data contains all the book titles, ISBNs, author, publisher and year of publication. The user dataset contains the user IDs, location and age. The ratings dataset contains user ids, ISBNs and book rating scores. All datasets are a subset of books available on Amazon.
All datasets have been sourced via Kaggle's Books Dataset.
Ecommerce serves online customers through its product and service offerings, but in the world of big data, ecommerce businesses need to provide a signal amongst the noise. Efficient filtering to extract and provide useful value is critical to the success of an ecommerce business. This is where a recommendation system steps in.
Recommendation systems drive conversions, increase sales and revenue with an overall elevation of the customer experience to promote the growth of customer acquisition and satisfaction.
There are two popular primary recommendation system models.
- Collaborative Filtering - Recommends items based on similarity measures between users and/or items leveraging the use of a user-item matrix.
- Content-Based Filtering - Supervised machine learning to induce classifier to discriminate between interesting and uninteresting items for the user.
This project implements the collaborative filtering recommendation system.
This model has a few core features that should be acknowledged when reviewing this project:
- The model's assumption is that people generally tend to like similar things
- Predictions are made based on item preferences of similar users
- User-Item matrix is used to generate recommendations
- Direct User Ratings are obtained through explicit feedback via rating scores
- Indirect User Behavior can be obtained through implicit feedback such as listening, watching, purchasing, etc.)
This project is unable to incorporate indirect user behavior with the available dataset and thus it is excluded from this project.
- For the data cleanup, refer to
cleanup.ipynb
. - For exploratory analysis and recommendation system, refer to
recommendations.ipynb
. - Raw and cleaned datasets are stored in the
Resources
folder.
In order to run this project, you will need the following libraries:
- pandas
- pathlib
- numpy
- re
- seaborn
- scipy
- sklearn
To initiate the cleanup process a few key checks and actions were completed on all 3 datasets as required:
- Check nulls
- Check duplicates
- Manage nulls/duplicates
Following this standard cleanup, each dataframe was explored for its unique qualities to determine what other cleanup decisions were required to optimize the recommendation system for performance.
-
Null Values - Some null values were identified with the author and publisher, thus correct values were added to the books dataset from researching and cross-referencing ISBNs via Amazon and BookFinder.
-
Year of Publication - I discovered that some years in the dataset contained values for the year 0, 2024, 2026, 2030, 2037, 2038 and 2050. Evidently, the year 0 doesn't make sense and years in the future also do not make sense. All observations with these values were dropped. After this operation, the oldest publication year is set at 1376 and the most recent is 2021.
-
ISBNs and Book Titles - It should be noted that there are duplicate book titles due to certain books having multiple publishers or different years of publication. For example, The Left Hand of Darkness by Ursula K. Le Guin was published in 1984 by Penguin Putnam-Mass and again in 1999 by Sagebrush Bound. At this time, these duplications have not been managed, but there is a future opportunity to consolidate these duplications to further optimize the recommendation system.
- Age - Age values that were less than 5 and greater than 90 were imputed to null values. This was done since I believe it is unlikely that a person younger than 5 and older than 90 would be submitting ratings for books purchased via Amazon. Null values were then imputed to the average age of 35 in the dataset.
- Location - Some location values were not null, but were actually strings of 'n/a, n/a, n/a'. Observations with this value were dropped from the dataset.
- Book Rating - '0' - There was a high count of 716,109 book rating scores of 0 of the total 1,149,780 observations. The 0 rating provides no value to the recommendation system and thus all observations with a 0 rating were removed from the dataset.
The next step is to merge the 3 datasets into a single DataFrame. Exploring and understanding the data is important since we want to be sure we know what data and features are being fed into our machine learning model. Below I highlight some key statistics and visualizations.
The books with the highest ratings count and mean include:
Below are the top 10 'super' raters:
Most users don't rate heavily as shown in the above average ratings per user. Though, some 'super' raters do exist as shown in the top 10 user ratings count above.
There are major peaks where books are rated between 5-10.
Books with the most ratings are largely scored in the 5-10 zone with heavy concentration in the 7-9 zone.
In order to feed the data into the machine learning model, the alphanumeric ISBN values had to be assigned unique integer IDs. This process was executed in the following steps:
- Use
.ravel()
method to create array of unique ISBN values and store inbook_ids
variable. - Cast
book_ids
array to pandas series. - Convert
book_ids
to pandas DataFrame - Reset index of
book_ids
, rename columns to ISBN and Book-ID - Merge
book_ids
DataFrame with larger merged dataset
Leveraging the scipy library, I created a create_matrix
function captured below:
Then, I feed the mapping values to X in preparation for sklearn's K-Nearest Neighbours:
Next, I create a find_similar_books
function to feed the data through the K-Nearest Neighbours machine learning model:
Finally, I assign books to a dictionary to feed to the find_similar_books
function.
In order to find a recommendation, you will need to obtain the Book-ID from the ISBN since the find_similar_books
value requires the Book-ID to provide recommendations.
- Book Title Cleanup - Remove book title duplicates with unique ISBNs
- Performance evaluation
- Tuning and exploring other machine learning cluserting algorithms for best performance
This is an ongoing project and will be updated until the best performing recommendation system is developed.
- Amazon - Cross-referencing null values with BookFinder.com
- BookFinder - Cross-referencing null values with Amazon.com
- Geeks For Geeks - Find location of an element in pandas dataframe in python
- Geeks for Geeks - How to check string is alphanumeric or not using regular expressions
- Geeks for Geeks - Recommendation system in Python
- Kaggle - Books Dataset
- Kaggle - Recommender System for Books
- Nick McCullum - Recommendations Systems Python
- Stack Overflow - Assign Unique ID to columns pandas dataframe
- Scikit-learn - NearestNeighbors
- Scikit-learn - Sparse CSR Matrix
- Towards Data Science - Handling Sparse Matrix - Concept Behind Compressed Sparse Row (CSR) Matrix