Movie Recommendation Engine

Author: Waldy Setiono (waldysetiono@gmail.com)

Introduction

Recommendation systems are widely used in our daily lives and to some extent play a significant role in shaping the desicions we make. Almost everything we buy, watch, consume, use, or even do is influenced by some form of recommendation, be it from friends, google search, family, shaman, preacher, political leader, advisor, lawyer, doctor, scholar, online reviews, app algorithm, and so on. Big companies gain substantial revenue growth by implementing recommender engine to their platforms.

Recommendation systems can be built using:

Content-based Filtering,
Collaborative Filtering, or
Combination of both (hybrid)

While content-based filtering attempts to guess what users may like based on their own activities, collaborative filtering tries to predict what a user might like based on other users that have similarity with the user in question. Collaborative filtering can be memory-based or model-based.

This project aims to develop an end-to-end recommendation system that can suggest someone some movies that she/he might like using model-based colaborative filtering.

Data Preparation

# Import packages.
import pandas as pd
import numpy as np
from io import BytesIO
from urllib.request import urlopen
from zipfile import ZipFile
import os
import platform
import pprint
from typing import Dict, Text
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.decomposition import TruncatedSVD

Data: The data used in this project is from GroupLens, a research lab at the University of of Minnesota. This dataset contains over 100,000 ratings applied to 9,000 movies by 600 users.

# Load dataset.
zipurl = "https://files.grouplens.org/datasets/movielens/ml-latest-small.zip"
with urlopen(zipurl) as zipresp:
    with ZipFile(BytesIO(zipresp.read())) as zfile:
        zfile.extractall("/tmp/movielens")

Titles and Genres

# Check movie titles.
movies = pd.read_csv('/tmp/movielens/ml-latest-small/movies.csv')
movies

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy
...	...	...	...
9737	193581	Black Butler: Book of the Atlantic (2017)	Action\|Animation\|Comedy\|Fantasy
9738	193583	No Game No Life: Zero (2017)	Animation\|Comedy\|Fantasy
9739	193585	Flint (2017)	Drama
9740	193587	Bungo Stray Dogs: Dead Apple (2018)	Action\|Animation
9741	193609	Andrew Dice Clay: Dice Rules (1991)	Comedy

9742 rows × 3 columns

# Print how many unique values of each column. 
print("There are ", movies.movieId.nunique(), "unique values in movieID.")
print("There are ", movies.title.nunique(), "unique values in title.")
print("There are ", movies.genres.nunique(), "unique values in genres.")

There are  9742 unique values in movieID.
There are  9737 unique values in title.
There are  951 unique values in genres.

Ratings

# Make a dataframe of ratings.
ratings = pd.read_csv('/tmp/movielens/ml-latest-small/ratings.csv')
ratings

	userId	movieId	rating	timestamp
0	1	1	4.0	964982703
1	1	3	4.0	964981247
2	1	6	4.0	964982224
3	1	47	5.0	964983815
4	1	50	5.0	964982931
...	...	...	...	...
100831	610	166534	4.0	1493848402
100832	610	168248	5.0	1493850091
100833	610	168250	5.0	1494273047
100834	610	168252	5.0	1493846352
100835	610	170875	3.0	1493846415

100836 rows × 4 columns

# Drop timestamp from the dataframe.
ratings = ratings.drop(columns=["timestamp"])
ratings

	userId	movieId	rating
0	1	1	4.0
1	1	3	4.0
2	1	6	4.0
3	1	47	5.0
4	1	50	5.0
...	...	...	...
100831	610	166534	4.0
100832	610	168248	5.0
100833	610	168250	5.0
100834	610	168252	5.0
100835	610	170875	3.0

100836 rows × 3 columns

# Print how many unique values of each column. 
print("There are ", ratings.userId.nunique(), "unique values in userID.")
print("There are ", ratings.movieId.nunique(), "unique values in movieID.")
print("There are ", ratings.rating.nunique(), "unique values in rating.")

There are  610 unique values in userID.
There are  9724 unique values in movieID.
There are  10 unique values in rating.

# Check missing values.
null_data = ratings[ratings.isnull().any(axis=1)]
null_data

	userId	movieId	rating

It seems there is no missing value in the dataframe.

Movies and Ratings

# Merge movies and ratings.
movies_ratings = pd.merge(ratings, movies, on='movieId')
movies_ratings = movies_ratings.drop(columns=["movieId", "genres"])
movies_ratings

	userId	rating	title
0	1	4.0	Toy Story (1995)
1	5	4.0	Toy Story (1995)
2	7	4.5	Toy Story (1995)
3	15	2.5	Toy Story (1995)
4	17	4.5	Toy Story (1995)
...	...	...	...
100831	610	2.5	Bloodmoon (1997)
100832	610	4.5	Sympathy for the Underdog (1971)
100833	610	3.0	Hazard (2005)
100834	610	3.5	Blair Witch (2016)
100835	610	3.5	31 (2016)

100836 rows × 3 columns

Popularity-based Recommender

One of the simplest movie recommender systems is popularity-based recommender. This can be done for example by suggesting Top 10 of the most rated movies.

# Recommend movies based on rating counts.
rating_count = pd.DataFrame(movies_ratings.groupby("title")["rating"].count())
rating_count.sort_values("rating", ascending=False).head(10)

	rating
title
Forrest Gump (1994)	329
Shawshank Redemption, The (1994)	317
Pulp Fiction (1994)	307
Silence of the Lambs, The (1991)	279
Matrix, The (1999)	278
Star Wars: Episode IV - A New Hope (1977)	251
Jurassic Park (1993)	238
Braveheart (1995)	237
Terminator 2: Judgment Day (1991)	224
Schindler's List (1993)	220

Utility Matrix

In order to make a recommendation system based on collaborative filtering, let's make a utility matrix containing user ID, movie ID, and how the users rate the movies using pivot table.

# Create utility matrix using pivot table.
X = movies_ratings.pivot_table(values='rating', index='title', columns='userId').fillna(0)
X

userId	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32	33	34	35	36	37	38	39	40	...	571	572	573	574	575	576	577	578	579	580	581	582	583	584	585	586	587	588	589	590	591	592	593	594	595	596	597	598	599	600	601	602	603	604	605	606	607	608	609	610
title
'71 (2014)	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	4.0
'Hellboy': The Seeds of Creation (2004)	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
'Round Midnight (1986)	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
'Salem's Lot (2004)	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
'Til There Was You (1997)	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
eXistenZ (1999)	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	4.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2.5	0.0	0.0	0.0	5.0	0.0	0.0	0.0	0.0	4.5	0.0	0.0
xXx (2002)	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.5	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	3.5	0.0	2.0
xXx: State of the Union (2005)	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.5
¡Three Amigos! (1986)	4.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	5.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	3.0	0.0	2.5	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
À nous la liberté (Freedom for Us) (1931)	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

9719 rows × 610 columns

Matrix Decomposition

# Decompose utility matrix using Truncated SVD
svd = TruncatedSVD(n_components=12, random_state=17)
decomposed_matrix = svd.fit_transform(X)

# Check the resultant matrix shape
decomposed_matrix.shape

(9719, 12)

print(svd.explained_variance_ratio_)

[0.17452408 0.04189715 0.02633773 0.02137632 0.0185918  0.01612086
 0.0143732  0.01178569 0.01147853 0.0099213  0.00934755 0.00905726]

print(svd.explained_variance_ratio_.sum())

0.3648114678094881

Recommender

# Generate correlation matrix
corr_matrix = np.corrcoef(decomposed_matrix)
print(corr_matrix.shape)
corr_matrix

(9719, 9719)





array([[ 1.        ,  0.20967451,  0.30277437, ...,  0.79074266,
        -0.09266651, -0.11632059],
       [ 0.20967451,  1.        ,  0.93621217, ...,  0.11127732,
         0.03997583, -0.24647969],
       [ 0.30277437,  0.93621217,  1.        , ...,  0.10717506,
         0.19895528,  0.01216579],
       ...,
       [ 0.79074266,  0.11127732,  0.10717506, ...,  1.        ,
        -0.11547412, -0.11670845],
       [-0.09266651,  0.03997583,  0.19895528, ..., -0.11547412,
         1.        ,  0.32751487],
       [-0.11632059, -0.24647969,  0.01216579, ..., -0.11670845,
         0.32751487,  1.        ]])

# Create list of movies names
movies_names = X.index
movies_list = list(movies_names)

Suppose we want to recommend movies similar to Spider-Man.

# Find a movie on which our recommendation based
basis_movie = movies_names.str.contains('Spider', regex=False)
for x in range(len(basis)):
  if basis_movie[x] == True:
    print(movies_names[x])

Along Came a Spider (2001)
Amazing Spider-Man, The (2012)
Giant Spider Invasion, The (1975)
Horrors of Spider Island (Ein Toter Hing im Netz) (1960)
Kiss of the Spider Woman (1985)
Spider (2002)
Spider-Man (2002)
Spider-Man 2 (2004)
Spider-Man 3 (2007)
Spiderwick Chronicles, The (2008)
The Amazing Spider-Man 2 (2014)
Untitled Spider-Man Reboot (2017)

# Isolate basis movie from the correlation matrix
basis_index = movies_list.index('Spider-Man (2002)')
print(basis_index)

Pearson Correlation Coefficient

# Calculate the correlation
corr_similar_movies = corr_matrix[basis_index]
corr_similar_movies

array([0.216504  , 0.56198288, 0.53920046, ..., 0.44844531, 0.50734146,
       0.0592397 ])

Recommend highly correlated movies

list(movies_names[(corr_similar_movies < 1) & (corr_similar_movies > 0.9)])

['A.I. Artificial Intelligence (2001)',
 'Armageddon (1998)',
 'Back to the Future Part II (1989)',
 'Back to the Future Part III (1990)',
 'Batman Begins (2005)',
 'Big Fish (2003)',
 'Bourne Identity, The (2002)',
 'Bourne Supremacy, The (2004)',
 'Cast Away (2000)',
 'Catch Me If You Can (2002)',
 "Charlie's Angels (2000)",
 'Chicken Run (2000)',
 'Chronicles of Narnia: The Lion, the Witch and the Wardrobe, The (2005)',
 'Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)',
 'Fifth Element, The (1997)',
 'Gladiator (2000)',
 'Hero (Ying xiong) (2002)',
 'House of Flying Daggers (Shi mian mai fu) (2004)',
 'Illusionist, The (2006)',
 'Incredibles, The (2004)',
 'Italian Job, The (2003)',
 'K-PAX (2001)',
 'Last Samurai, The (2003)',
 'Lord of the Rings: The Fellowship of the Ring, The (2001)',
 'Lord of the Rings: The Two Towers, The (2002)',
 'Mask of Zorro, The (1998)',
 'Matrix Reloaded, The (2003)',
 'Matrix Revolutions, The (2003)',
 'Minority Report (2002)',
 'Monsters, Inc. (2001)',
 "Ocean's Eleven (2001)",
 'Pirates of the Caribbean: The Curse of the Black Pearl (2003)',
 'Road to Perdition (2002)',
 'School of Rock (2003)',
 'Serenity (2005)',
 'Shrek (2001)',
 'Shrek 2 (2004)',
 'Signs (2002)',
 'Spider-Man (2002)',
 'Spider-Man 2 (2004)',
 'Star Wars: Episode I - The Phantom Menace (1999)',
 'Star Wars: Episode II - Attack of the Clones (2002)',
 'Star Wars: Episode III - Revenge of the Sith (2005)',
 'Truman Show, The (1998)',
 'Unbreakable (2000)',
 'WarGames (1983)',
 'X-Men (2000)',
 'X-Men: The Last Stand (2006)',
 'X2: X-Men United (2003)']

References

Rajaraman, A., Ullman, J. D. (2014). Mining of massive datasets. Cambridge: Cambridge University Press.
Banik, R. (2018). Hands-on recommendation systems with python. Birmingham: Packt.
Department of Computer Science and Engineering, University of Minnesota. (2021). GroupLens. https://grouplens.org/datasets/movielens
Strang, G. (2016). Introduction to linear algebra. MA: Wellesley-Cambridge Press.
Felferning, A., et al. (2018). Group recommender systems. Springer.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
movie_recommender.ipynb		movie_recommender.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Movie Recommendation Engine

Introduction

Data Preparation

Matrix Decomposition

Recommender

References

About

Releases

Packages

Languages

waldysetio/movie-recommender

Folders and files

Latest commit

History

Repository files navigation

Movie Recommendation Engine

Introduction

Data Preparation

Matrix Decomposition

Recommender

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages