Books Recommendation System

Low Light Photography of Books by Suzy Hazelwood | Pexels Licensed

Description & Objective

The goal of this project is to develop a recommendation system that provides a list of 10 books that are similar to a book that a customer has read. This project will implement a collaborative-based filtering method via scikit learn's K-Nearest Neighbours clustering algorithm using the Amazon books dataset. The books data contains all the book titles, ISBNs, author, publisher and year of publication. The user dataset contains the user IDs, location and age. The ratings dataset contains user ids, ISBNs and book rating scores. All datasets are a subset of books available on Amazon.

All datasets have been sourced via Kaggle's Books Dataset.

Why Build a Recommendation System?

Ecommerce serves online customers through its product and service offerings, but in the world of big data, ecommerce businesses need to provide a signal amongst the noise. Efficient filtering to extract and provide useful value is critical to the success of an ecommerce business. This is where a recommendation system steps in.

Recommendation systems drive conversions, increase sales and revenue with an overall elevation of the customer experience to promote the growth of customer acquisition and satisfaction.

There are two popular primary recommendation system models.

Collaborative Filtering - Recommends items based on similarity measures between users and/or items leveraging the use of a user-item matrix.
Content-Based Filtering - Supervised machine learning to induce classifier to discriminate between interesting and uninteresting items for the user.

This project implements the collaborative filtering recommendation system.

Collaborative Filtering

This model has a few core features that should be acknowledged when reviewing this project:

The model's assumption is that people generally tend to like similar things
Predictions are made based on item preferences of similar users
User-Item matrix is used to generate recommendations
Direct User Ratings are obtained through explicit feedback via rating scores
Indirect User Behavior can be obtained through implicit feedback such as listening, watching, purchasing, etc.)

This project is unable to incorporate indirect user behavior with the available dataset and thus it is excluded from this project.

Project Contents

For the data cleanup, refer to cleanup.ipynb.
For exploratory analysis and recommendation system, refer to recommendations.ipynb.
Raw and cleaned datasets are stored in the Resources folder.

Libraries

In order to run this project, you will need the following libraries:

pandas
pathlib
numpy
re
seaborn
scipy
sklearn

Data Cleanup

To initiate the cleanup process a few key checks and actions were completed on all 3 datasets as required:

Check nulls
Check duplicates
Manage nulls/duplicates

Following this standard cleanup, each dataframe was explored for its unique qualities to determine what other cleanup decisions were required to optimize the recommendation system for performance.

Books Data Cleanup

Null Values - Some null values were identified with the author and publisher, thus correct values were added to the books dataset from researching and cross-referencing ISBNs via Amazon and BookFinder.
Year of Publication - I discovered that some years in the dataset contained values for the year 0, 2024, 2026, 2030, 2037, 2038 and 2050. Evidently, the year 0 doesn't make sense and years in the future also do not make sense. All observations with these values were dropped. After this operation, the oldest publication year is set at 1376 and the most recent is 2021.
ISBNs and Book Titles - It should be noted that there are duplicate book titles due to certain books having multiple publishers or different years of publication. For example, The Left Hand of Darkness by Ursula K. Le Guin was published in 1984 by Penguin Putnam-Mass and again in 1999 by Sagebrush Bound. At this time, these duplications have not been managed, but there is a future opportunity to consolidate these duplications to further optimize the recommendation system.

Users Data Cleanup

Age - Age values that were less than 5 and greater than 90 were imputed to null values. This was done since I believe it is unlikely that a person younger than 5 and older than 90 would be submitting ratings for books purchased via Amazon. Null values were then imputed to the average age of 35 in the dataset.
Location - Some location values were not null, but were actually strings of 'n/a, n/a, n/a'. Observations with this value were dropped from the dataset.

Ratings Data Cleanup

Book Rating - '0' - There was a high count of 716,109 book rating scores of 0 of the total 1,149,780 observations. The 0 rating provides no value to the recommendation system and thus all observations with a 0 rating were removed from the dataset.

Exploratory Analysis

The next step is to merge the 3 datasets into a single DataFrame. Exploring and understanding the data is important since we want to be sure we know what data and features are being fed into our machine learning model. Below I highlight some key statistics and visualizations.

Key Merged Dataset Statistics

Top 10 Books with Highest Ratings Count

The books with the highest ratings count and mean include:

Top 10 User Ratings Count

Below are the top 10 'super' raters:

Histogram - Ratings Count

Most users don't rate heavily as shown in the above average ratings per user. Though, some 'super' raters do exist as shown in the top 10 user ratings count above.

Histogram - Average Rating

There are major peaks where books are rated between 5-10.

Histogram - Ratings Average and Count Joint Plot

Books with the most ratings are largely scored in the 5-10 zone with heavy concentration in the 7-9 zone.

Recommendation System

In order to feed the data into the machine learning model, the alphanumeric ISBN values had to be assigned unique integer IDs. This process was executed in the following steps:

Use .ravel() method to create array of unique ISBN values and store in book_ids variable.
Cast book_ids array to pandas series.
Convert book_ids to pandas DataFrame
Reset index of book_ids, rename columns to ISBN and Book-ID
Merge book_ids DataFrame with larger merged dataset

Compressed Sparse Row Matrix

Leveraging the scipy library, I created a create_matrix function captured below:

Then, I feed the mapping values to X in preparation for sklearn's K-Nearest Neighbours:

Scikit-Learn's NearestNeighbours

Next, I create a find_similar_books function to feed the data through the K-Nearest Neighbours machine learning model:

Finally, I assign books to a dictionary to feed to the find_similar_books function.

How to Find Recommendations

In order to find a recommendation, you will need to obtain the Book-ID from the ISBN since the find_similar_books value requires the Book-ID to provide recommendations.

Recommendation Samples

Since you read Brave New World:

Since you read The Da Vinci Code:

Next Steps

Book Title Cleanup - Remove book title duplicates with unique ISBNs
Performance evaluation
Tuning and exploring other machine learning cluserting algorithms for best performance

This is an ongoing project and will be updated until the best performing recommendation system is developed.

Resources

Amazon - Cross-referencing null values with BookFinder.com
BookFinder - Cross-referencing null values with Amazon.com
Geeks For Geeks - Find location of an element in pandas dataframe in python
Geeks for Geeks - How to check string is alphanumeric or not using regular expressions
Geeks for Geeks - Recommendation system in Python
Kaggle - Books Dataset
Kaggle - Recommender System for Books
Nick McCullum - Recommendations Systems Python
Stack Overflow - Assign Unique ID to columns pandas dataframe
Scikit-learn - NearestNeighbors
Scikit-learn - Sparse CSR Matrix
Towards Data Science - Handling Sparse Matrix - Concept Behind Compressed Sparse Row (CSR) Matrix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Books Recommendation System

Description & Objective

Why Build a Recommendation System?

Collaborative Filtering

Project Contents

Libraries

Data Cleanup

Books Data Cleanup

Users Data Cleanup

Ratings Data Cleanup

Exploratory Analysis

Key Merged Dataset Statistics

Top 10 Books with Highest Ratings Count

Top 10 User Ratings Count

Histogram - Ratings Count

Histogram - Average Rating

Histogram - Ratings Average and Count Joint Plot

Recommendation System

Compressed Sparse Row Matrix

Scikit-Learn's NearestNeighbours

How to Find Recommendations

Recommendation Samples

Since you read Brave New World:

Since you read The Da Vinci Code:

Next Steps

Resources

Files

README.md

Latest commit

History

README.md

File metadata and controls

Books Recommendation System

Description & Objective

Why Build a Recommendation System?

Collaborative Filtering

Project Contents

Libraries

Data Cleanup

Books Data Cleanup

Users Data Cleanup

Ratings Data Cleanup

Exploratory Analysis

Key Merged Dataset Statistics

Top 10 Books with Highest Ratings Count

Top 10 User Ratings Count

Histogram - Ratings Count

Histogram - Average Rating

Histogram - Ratings Average and Count Joint Plot

Recommendation System

Compressed Sparse Row Matrix

Scikit-Learn's NearestNeighbours

How to Find Recommendations

Recommendation Samples

Since you read Brave New World:

Since you read The Da Vinci Code:

Next Steps

Resources