Skip to content

Movie Recommendation System using model-based collaborative filtering based on over 100.000 ratings applied to more than 9.000 movies.

Notifications You must be signed in to change notification settings

waldysetio/movie-recommender

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

Movie Recommendation Engine

Author: Waldy Setiono (waldysetiono@gmail.com)

Introduction

Recommendation systems are widely used in our daily lives and to some extent play a significant role in shaping the desicions we make. Almost everything we buy, watch, consume, use, or even do is influenced by some form of recommendation, be it from friends, google search, family, shaman, preacher, political leader, advisor, lawyer, doctor, scholar, online reviews, app algorithm, and so on. Big companies gain substantial revenue growth by implementing recommender engine to their platforms.

Recommendation systems can be built using:

  1. Content-based Filtering,
  2. Collaborative Filtering, or
  3. Combination of both (hybrid)

While content-based filtering attempts to guess what users may like based on their own activities, collaborative filtering tries to predict what a user might like based on other users that have similarity with the user in question. Collaborative filtering can be memory-based or model-based.

This project aims to develop an end-to-end recommendation system that can suggest someone some movies that she/he might like using model-based colaborative filtering.

Data Preparation

# Import packages.
import pandas as pd
import numpy as np
from io import BytesIO
from urllib.request import urlopen
from zipfile import ZipFile
import os
import platform
import pprint
from typing import Dict, Text
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.decomposition import TruncatedSVD

Data: The data used in this project is from GroupLens, a research lab at the University of of Minnesota. This dataset contains over 100,000 ratings applied to 9,000 movies by 600 users.

# Load dataset.
zipurl = "https://files.grouplens.org/datasets/movielens/ml-latest-small.zip"
with urlopen(zipurl) as zipresp:
    with ZipFile(BytesIO(zipresp.read())) as zfile:
        zfile.extractall("/tmp/movielens")

Titles and Genres

# Check movie titles.
movies = pd.read_csv('/tmp/movielens/ml-latest-small/movies.csv')
movies
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
... ... ... ...
9737 193581 Black Butler: Book of the Atlantic (2017) Action|Animation|Comedy|Fantasy
9738 193583 No Game No Life: Zero (2017) Animation|Comedy|Fantasy
9739 193585 Flint (2017) Drama
9740 193587 Bungo Stray Dogs: Dead Apple (2018) Action|Animation
9741 193609 Andrew Dice Clay: Dice Rules (1991) Comedy

9742 rows × 3 columns

# Print how many unique values of each column. 
print("There are ", movies.movieId.nunique(), "unique values in movieID.")
print("There are ", movies.title.nunique(), "unique values in title.")
print("There are ", movies.genres.nunique(), "unique values in genres.")
There are  9742 unique values in movieID.
There are  9737 unique values in title.
There are  951 unique values in genres.

Ratings

# Make a dataframe of ratings.
ratings = pd.read_csv('/tmp/movielens/ml-latest-small/ratings.csv')
ratings
userId movieId rating timestamp
0 1 1 4.0 964982703
1 1 3 4.0 964981247
2 1 6 4.0 964982224
3 1 47 5.0 964983815
4 1 50 5.0 964982931
... ... ... ... ...
100831 610 166534 4.0 1493848402
100832 610 168248 5.0 1493850091
100833 610 168250 5.0 1494273047
100834 610 168252 5.0 1493846352
100835 610 170875 3.0 1493846415

100836 rows × 4 columns

# Drop timestamp from the dataframe.
ratings = ratings.drop(columns=["timestamp"])
ratings
userId movieId rating
0 1 1 4.0
1 1 3 4.0
2 1 6 4.0
3 1 47 5.0
4 1 50 5.0
... ... ... ...
100831 610 166534 4.0
100832 610 168248 5.0
100833 610 168250 5.0
100834 610 168252 5.0
100835 610 170875 3.0

100836 rows × 3 columns

# Print how many unique values of each column. 
print("There are ", ratings.userId.nunique(), "unique values in userID.")
print("There are ", ratings.movieId.nunique(), "unique values in movieID.")
print("There are ", ratings.rating.nunique(), "unique values in rating.")
There are  610 unique values in userID.
There are  9724 unique values in movieID.
There are  10 unique values in rating.
# Check missing values.
null_data = ratings[ratings.isnull().any(axis=1)]
null_data
userId movieId rating

It seems there is no missing value in the dataframe.

Movies and Ratings

# Merge movies and ratings.
movies_ratings = pd.merge(ratings, movies, on='movieId')
movies_ratings = movies_ratings.drop(columns=["movieId", "genres"])
movies_ratings
userId rating title
0 1 4.0 Toy Story (1995)
1 5 4.0 Toy Story (1995)
2 7 4.5 Toy Story (1995)
3 15 2.5 Toy Story (1995)
4 17 4.5 Toy Story (1995)
... ... ... ...
100831 610 2.5 Bloodmoon (1997)
100832 610 4.5 Sympathy for the Underdog (1971)
100833 610 3.0 Hazard (2005)
100834 610 3.5 Blair Witch (2016)
100835 610 3.5 31 (2016)

100836 rows × 3 columns

Popularity-based Recommender

One of the simplest movie recommender systems is popularity-based recommender. This can be done for example by suggesting Top 10 of the most rated movies.

# Recommend movies based on rating counts.
rating_count = pd.DataFrame(movies_ratings.groupby("title")["rating"].count())
rating_count.sort_values("rating", ascending=False).head(10)
rating
title
Forrest Gump (1994) 329
Shawshank Redemption, The (1994) 317
Pulp Fiction (1994) 307
Silence of the Lambs, The (1991) 279
Matrix, The (1999) 278
Star Wars: Episode IV - A New Hope (1977) 251
Jurassic Park (1993) 238
Braveheart (1995) 237
Terminator 2: Judgment Day (1991) 224
Schindler's List (1993) 220

Utility Matrix

In order to make a recommendation system based on collaborative filtering, let's make a utility matrix containing user ID, movie ID, and how the users rate the movies using pivot table.

# Create utility matrix using pivot table.
X = movies_ratings.pivot_table(values='rating', index='title', columns='userId').fillna(0)
X
userId 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 ... 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610
title
'71 (2014) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.0
'Hellboy': The Seeds of Creation (2004) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
'Round Midnight (1986) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
'Salem's Lot (2004) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
'Til There Was You (1997) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
eXistenZ (1999) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.5 0.0 0.0 0.0 5.0 0.0 0.0 0.0 0.0 4.5 0.0 0.0
xXx (2002) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.5 0.0 2.0
xXx: State of the Union (2005) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.5
¡Three Amigos! (1986) 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 2.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
À nous la liberté (Freedom for Us) (1931) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

9719 rows × 610 columns

Matrix Decomposition

# Decompose utility matrix using Truncated SVD
svd = TruncatedSVD(n_components=12, random_state=17)
decomposed_matrix = svd.fit_transform(X)

# Check the resultant matrix shape
decomposed_matrix.shape
(9719, 12)
print(svd.explained_variance_ratio_)
[0.17452408 0.04189715 0.02633773 0.02137632 0.0185918  0.01612086
 0.0143732  0.01178569 0.01147853 0.0099213  0.00934755 0.00905726]
print(svd.explained_variance_ratio_.sum())
0.3648114678094881

Recommender

# Generate correlation matrix
corr_matrix = np.corrcoef(decomposed_matrix)
print(corr_matrix.shape)
corr_matrix
(9719, 9719)





array([[ 1.        ,  0.20967451,  0.30277437, ...,  0.79074266,
        -0.09266651, -0.11632059],
       [ 0.20967451,  1.        ,  0.93621217, ...,  0.11127732,
         0.03997583, -0.24647969],
       [ 0.30277437,  0.93621217,  1.        , ...,  0.10717506,
         0.19895528,  0.01216579],
       ...,
       [ 0.79074266,  0.11127732,  0.10717506, ...,  1.        ,
        -0.11547412, -0.11670845],
       [-0.09266651,  0.03997583,  0.19895528, ..., -0.11547412,
         1.        ,  0.32751487],
       [-0.11632059, -0.24647969,  0.01216579, ..., -0.11670845,
         0.32751487,  1.        ]])
# Create list of movies names
movies_names = X.index
movies_list = list(movies_names)

Suppose we want to recommend movies similar to Spider-Man.

# Find a movie on which our recommendation based
basis_movie = movies_names.str.contains('Spider', regex=False)
for x in range(len(basis)):
  if basis_movie[x] == True:
    print(movies_names[x])
Along Came a Spider (2001)
Amazing Spider-Man, The (2012)
Giant Spider Invasion, The (1975)
Horrors of Spider Island (Ein Toter Hing im Netz) (1960)
Kiss of the Spider Woman (1985)
Spider (2002)
Spider-Man (2002)
Spider-Man 2 (2004)
Spider-Man 3 (2007)
Spiderwick Chronicles, The (2008)
The Amazing Spider-Man 2 (2014)
Untitled Spider-Man Reboot (2017)
# Isolate basis movie from the correlation matrix
basis_index = movies_list.index('Spider-Man (2002)')
print(basis_index)
7921

Pearson Correlation Coefficient

# Calculate the correlation
corr_similar_movies = corr_matrix[basis_index]
corr_similar_movies
array([0.216504  , 0.56198288, 0.53920046, ..., 0.44844531, 0.50734146,
       0.0592397 ])

Recommend highly correlated movies

list(movies_names[(corr_similar_movies < 1) & (corr_similar_movies > 0.9)])
['A.I. Artificial Intelligence (2001)',
 'Armageddon (1998)',
 'Back to the Future Part II (1989)',
 'Back to the Future Part III (1990)',
 'Batman Begins (2005)',
 'Big Fish (2003)',
 'Bourne Identity, The (2002)',
 'Bourne Supremacy, The (2004)',
 'Cast Away (2000)',
 'Catch Me If You Can (2002)',
 "Charlie's Angels (2000)",
 'Chicken Run (2000)',
 'Chronicles of Narnia: The Lion, the Witch and the Wardrobe, The (2005)',
 'Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)',
 'Fifth Element, The (1997)',
 'Gladiator (2000)',
 'Hero (Ying xiong) (2002)',
 'House of Flying Daggers (Shi mian mai fu) (2004)',
 'Illusionist, The (2006)',
 'Incredibles, The (2004)',
 'Italian Job, The (2003)',
 'K-PAX (2001)',
 'Last Samurai, The (2003)',
 'Lord of the Rings: The Fellowship of the Ring, The (2001)',
 'Lord of the Rings: The Two Towers, The (2002)',
 'Mask of Zorro, The (1998)',
 'Matrix Reloaded, The (2003)',
 'Matrix Revolutions, The (2003)',
 'Minority Report (2002)',
 'Monsters, Inc. (2001)',
 "Ocean's Eleven (2001)",
 'Pirates of the Caribbean: The Curse of the Black Pearl (2003)',
 'Road to Perdition (2002)',
 'School of Rock (2003)',
 'Serenity (2005)',
 'Shrek (2001)',
 'Shrek 2 (2004)',
 'Signs (2002)',
 'Spider-Man (2002)',
 'Spider-Man 2 (2004)',
 'Star Wars: Episode I - The Phantom Menace (1999)',
 'Star Wars: Episode II - Attack of the Clones (2002)',
 'Star Wars: Episode III - Revenge of the Sith (2005)',
 'Truman Show, The (1998)',
 'Unbreakable (2000)',
 'WarGames (1983)',
 'X-Men (2000)',
 'X-Men: The Last Stand (2006)',
 'X2: X-Men United (2003)']

References

  1. Rajaraman, A., Ullman, J. D. (2014). Mining of massive datasets. Cambridge: Cambridge University Press.

  2. Banik, R. (2018). Hands-on recommendation systems with python. Birmingham: Packt.

  3. Department of Computer Science and Engineering, University of Minnesota. (2021). GroupLens. https://grouplens.org/datasets/movielens

  4. Strang, G. (2016). Introduction to linear algebra. MA: Wellesley-Cambridge Press.

  5. Felferning, A., et al. (2018). Group recommender systems. Springer.

About

Movie Recommendation System using model-based collaborative filtering based on over 100.000 ratings applied to more than 9.000 movies.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published