Skip to content

A movie database and content based movie recommender system using cosine similarity

Notifications You must be signed in to change notification settings

HrithikRai/AMDB--Awesome-Movie-Database-and-Recommender-System

Repository files navigation

Awesome-Movie-Database-and-Recommender-System

Through this Project we attempt to create a Movie Database and Recommendation System. The recommendation section = page 4 - content based movie recommender system using cosine similarity. The user can perform the following operations within milliseconds response time:

  • Search for a movie by its title or part and get similar recommendations
  • View Top rated movies, Genre Wise
  • Find users in the database with similar taste of movies
  • Recommend movies to a given user from the database
  • Get recommendations based on your choice of movies

Dataset Used - Movielens ratings 25 million rows (https://grouplens.org/datasets/movielens/25m/)

Here we have 25 million ratings and one million tag applications applied to 62,000 movies by 162,000 users. List of Genres: Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western

Dataset Information :

Interesting facts -

From prelimnary analysis, it was found that there is a person on this Earth, that watched 32202 movies and rated them on the Movielens platform. That would probably take around 6-7 years. Hats Off T_T .

The movie rated most has over 80k ratings and that movie is Forrest Gump.

The ratings are distributed as follows:

A word cloud created from the user's tags :

word cloud

Database Creation :

SQLAlchemy with SQLite

SQLAlchemy offers several benefits over the raw SQL approach, including:

Cleaner code:

            Having SQL code as Python strings gets messy pretty quickly.

More secure code:

            Using SQLAlchemy's ORM functionalities can help mitigate against vulnerabilities such as SQL injection

Simpler logic:

            SQLAlchemy allows us to abstract all of our database logic into Python objects. 
            Instead of having to think on a table, row, and column level, 
            we can consider everything on a class, instance, and attribute level.

SQLAlchemy at three different layers of abstraction.

The lowest layer is using only SQLAlchemy's engine component to execute raw SQL.

The middle layer is using SQLAlchemy's expression language to build SQL statements in a more Pythonic way than using raw SQL strings.

The highest extraction layer is using SQLAlchemy's full Object Relational Mapping (ORM) capabilities which allows one to think in terms of Python classes and objects instead of database tables and connections.

Why SQLite?

It offers a full-featured relational database management system (RDBMS) that works with a single file to maintain all the database functionality.

It also has the advantage of not requiring a separate database server to function. The database file format is cross-platform and accessible to any programming language that supports SQLite.

Preprocessing related interesting issue - data_preprocessing

Finding the top 10 best rated movies - there are a lot of movies that have a perfect 5 rating but only rated by one or few users. While others have a rating like 4.5 but rated by thousands of user. So while suggesting the best rated movies which option should we go for?

solution - Laplace's rule of succession: Absolute High Rating vs The confidence gained by more data

For a movie with perfect 5 rating but only 1 rater. We assume that there are 2 more users who have given a 5 and 1 rating. Now the average rating falls down to (5+5+1)/3 = 3.66. This process is repeated for every other movie and the results according to this new ratings are fetched

Operations performed - Genre wise clustering, Laplace rule of succession, cosine similarity to estimate distances, groupby, genome tags extraction for sentiment analysis, dataframe manipulations

NOTE - tested and deployment ready (heroku, cloudrun etc.)

How to run the project ?

install packages listed in requirements.txt

download the movielens 25m dataset and extract the files in the movielens folder in the root directory.

run the files listed in data_preprocessing, it will generate all the necessary files to run the system.

in the terminal run command - streamlit run main_page.py

Do let me know incase you face any issues in running the code - contact me

Regards Hrithik

About

A movie database and content based movie recommender system using cosine similarity

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published