GitHub - jagadeeshm007/Movie_Recommendation_System: In this project, I've developed a movie recommendation system using machine learning techniques

Movie Recommendation System

Reference Links

The following links are for Reference and documentation of the used modules and models.

TOPICS	LINKS
Kaggle	kaggle.com
Data Set [Kaggle]	TMDB 5000 Movie Dataset
Numpy	Numpy.org
Pandas	Pandas.org
Abstract Syntax Trees	Python/AST
Lemmatization	Nltk/Lemmatization
Vectorization	Scikit Learn/CountVectorizer
Cosine Similarity	Scikit Learn/Cosine Similarity

Project Overview

In this project, I've developed a movie recommendation system using machine learning techniques. Below, I'll describe the models utilized, the imports, their functions, and why they were chosen.

Dataset

I used a dataset from Kaggle.com. The dataset I used is TMDB 5000 movies dataset. Check the reference links section to know about Features and more about the dataset.

Imports and Data Loading

I start by importing the necessary libraries, including NumPy and Pandas for data manipulation. Then, I load the dataset containing movie information from CSV (Comma-separated values) files using Pandas.

There are two files :

tmdb_5000_movies.csv

tmdb_5000_credits.csv

Data Preprocessing

After loading the data, I perform several preprocessing steps:

Merging datasets: I merge two datasets containing movie information and credits.
Removing Null Values: I remove rows with missing values in the dataset.So it future if we perform any operation it will not thow NaN or data missing errors.
Extracting Features: I extract relevant features such as genres, keywords, cast, and direction from the dataset.As i want to recommend the movies based on movie content and overview.

Text Processing

Text data such as genres, keywords, cast, and direction are processed using various techniques:

Certainly! Let's delve deeper into each of these concepts:

Abstract Syntax Trees

The data in the data Set is in JSON format of string.To convert it into python literal or container display syntax we used AST.

JSON (JavaScript Object Notation) is a lightweight data interchange format commonly used for representing structured data. In the context of this project, JSON is used to store information such as movie genres, keywords, cast, and direction.

When data is stored in JSON format, it's typically represented as a string.

The ast.literal_eval() function in Python is used to safely evaluate a string containing a Python literal or container display syntax (such as lists, dictionaries, tuples, etc.) and return the corresponding Python object.

This function is particularly useful when you want to evaluate strings that contain Python syntax but want to ensure that it is safe to do so, as it only evaluates strings containing literals and does not evaluate any arbitrary Python code.

Here's an example of how ast.literal_eval() can be used:

import ast

# Define a string containing a Python list literal
list_string = "[1, 2, 3, 4, 5]"

# Use ast.literal_eval() to evaluate the string and convert it to a Python list
result_list = ast.literal_eval(list_string)

print(result_list)

Output:

[1, 2, 3, 4, 5]

In this example, ast.literal_eval() safely evaluates the string "[1, 2, 3, 4, 5]" and returns the corresponding Python list [1, 2, 3, 4, 5].

Similarly, you can use ast.literal_eval() with other Python literals and container display syntax, such as dictionaries, tuples, and sets, as long as the string represents a valid Python expression.

Data Normalize

Lemmatization and stemming are both techniques used in natural language processing (NLP) to reduce words to their base or root form. While they serve a similar purpose, they differ in their approach and the results they produce.

Stemming

Stemming involves removing prefixes and suffixes from words to obtain their root form. It operates on the principle of chopping off word endings to achieve this. Stemming algorithms may use rules or heuristics to perform this operation, which can sometimes lead to over-stemming (where different words are reduced to the same stem) or under-stemming (where words are not reduced enough).

For example:

"running" -> "run"
"cars" -> "car"
"better" -> "better" (No change as "better" is already its root form)

Lemmatization

Lemmatization, on the other hand, aims to determine the lemma or base form of a word by considering its context and meaning. Unlike stemming, lemmatization takes into account the part of speech of the word and uses lexical knowledge databases (such as WordNet) to accurately identify the lemma.

For example:
- "running" -> "run"
- "cars" -> "car"
- "better" -> "good" (Correctly identified as the lemma of "better")

Why Lemmatization is Preferred

In many cases, lemmatization is preferred over stemming because it produces more accurate and meaningful results. Since lemmatization considers the context and meaning of words, it can correctly identify the base form even when the word undergoes complex inflectional changes. This is especially important in tasks like text classification, sentiment analysis, and information retrieval, where the precise meaning of words matters.

In the context of the movie recommendation system, lemmatization is preferred because it ensures that similar words are treated as the same entity, leading to better matching and more accurate recommendations. For example, lemmatization can correctly identify that "running", "runs", and "ran" all share the same base form "run", ensuring that movies with similar themes or keywords are appropriately matched. This enhances the effectiveness of the recommendation system in capturing the semantic similarities between movies and improving the user experience.

Here's an example of lemmatization in Python using the NLTK (Natural Language Toolkit) library:

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

word = "running"
lemmatized_word = lemmatizer.lemmatize(word, pos='v')

print(lemmatized_word)

This will output:

run

Vectorization

Vectorization is the process of converting text data into numerical vectors that machine learning models can understand. CountVectorizer is a commonly used technique for vectorizing text data.

CountVectorizer converts a collection of text documents into a matrix of token counts, where each row represents a document and each column represents a unique word (or token) in the corpus. The value at each position in the matrix represents the frequency of occurrence of that word in the document.

Here's an example of how CountVectorizer works in Python:

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
print(X.toarray())

This will output:

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]

Each row in the output array corresponds to a document, and each column represents a word. The value at each position indicates the frequency of that word in the corresponding document.

Model Building

I utilize cosine similarity to calculate the similarity between movies based on their features. The steps involved in the recommendation process include:

Euclidean distance and cosine similarity are both metrics used to measure the similarity between vectors, but they approach the concept of similarity in different ways and have different applications.

Euclidean Distance

Euclidean distance measures the straight-line distance between two points in Euclidean space. In the context of vectors, it calculates the geometric distance between two points represented by vectors. Euclidean distance is calculated using the formula:

Euclidean distance considers both the magnitude and direction of vectors.
It is sensitive to differences in magnitudes between vectors.
It is useful when the magnitude of the vectors is important and you want to measure their spatial distance.

Cosine Similarity

Cosine similarity measures the cosine of the angle between two vectors in a multi-dimensional space. It indicates the similarity in direction between the vectors, regardless of their magnitudes. Cosine similarity is calculated using the formula:

Cosine similarity only considers the direction of vectors, ignoring their magnitudes.
It is particularly useful when the magnitude of vectors is not important, and you want to focus on their orientation or semantic similarity.
Cosine similarity is widely used in text analysis, information retrieval, and recommendation systems.

Why Use Cosine Similarity in My Model?

In many natural language processing (NLP) tasks, such as text analysis and recommendation systems, the focus is often on the semantic similarity between documents or text samples.In these cases, the magnitude of vectors (e.g., word frequency) may not be relevant, but the orientation or direction of vectors (e.g., word semantics) is crucial.

Cosine similarity is preferred in such scenarios because it captures the semantic similarity between documents by measuring the angle between their vector representations. It is robust to changes in magnitude and scale, making it suitable for comparing text samples of varying lengths and frequencies.

In the context of my model, which is likely dealing with textual data such as movie descriptions or user preferences, cosine similarity is likely chosen because it focuses on the semantic similarity between movies. By measuring the similarity in direction between movie vectors, cosine similarity can effectively identify movies with similar themes, genres, or content, leading to more accurate recommendations.

Saving Results

Finally, I saved the movie list and recommendation data as CSV files for future reference. Additionally, I serialized and saved the recommendation data and similarity matrix using pickle for later use And also in json file for App data.

This movie recommendation engine provides personalized movie recommendations based on user preferences and similarity between movies, making it a valuable tool for users seeking new movie suggestions.

Graphs and Observations

The dataset Genres based graph:
Directer & Number of movies they directed :
Similarity plotting between movies
- note: This is not generated by me , it is taken as refrence and to visualize how the recommendation will work.

Installation

React Native Deployment

Steps	Commands
Clone the repository	`git clone https://github.com/jagadeeshm007/Movie_Recommendation_System.git`
Navigate to the app directory	`cd ./Movie_Recommendation_Engine/GUI/Movie_Recommendation_Search_App`
Create a `.env` file	`echo EXPO_API_KEY=YOUR_API_KEY > .env`
Install dependencies	`npm install`
Start the application locally	`npm start`

Python Testing

Steps	Commands
Execute the Python script	`python Movie_Recommendation_System.py`

Environment Variables

To run this project, you will need to add the following environment variables to your .env file

EXPO_API_KEY

Installation Instructions

[React Native] Deployment

To deploy this application, you require a TMDB API, as well as Node.js and React Native. You can obtain a free TMDB API from here. Follow the React Native Documentation for Installation. The link to the documentation is here.

After obtaining the TMDB API and installing Node.js & React Native, follow these installation steps:

First, clone the repository using the git clone command in the terminal:

git clone https://github.com/jagadeeshm007/Movie_Recommendation_System.git

Navigate to the folder GUI/Movie_Recommendation_Search_App using the following command in the terminal:

cd ./Movie_Recommendation_Engine/GUI/Movie_Recommendation_Search_App

Now, create a file named .env in that directory.

🛑🪧Note: Replace YOUR_API_KEY with your TMDB API Key.

echo EXPO_API_KEY=YOUR_API_KEY > .env

Run npm install to download required dependencies.

npm install

Use the npm start command to start the application locally.

npm start

⚡The application will run locally at localhost:8081. Use web or Expo Go to test the application.

⚠️NOTE:

For complete documentation about my React Native application check my repository.

🔗Movie_Recommendation_Search_App

🔗Python

Additionally, you can try out the Python code in the GUI to test it using the Movie_Recommendation_System.py file in the GUI folder.

Conclusion

This movie recommendation system provides personalized movie recommendations based on user preferences and similarity between movies. It can be further optimized and integrated into various applications to enhance user experience.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
DataSets		DataSets
GUI		GUI
data_preprocessing		data_preprocessing
graphs		graphs
readme_assets		readme_assets
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Movie Recommendation System

Table of Contents

Reference Links

The following links are for Reference and documentation of the used modules and models.

Project Overview

Dataset

Imports and Data Loading

Data Preprocessing

Text Processing

Abstract Syntax Trees

Data Normalize

Stemming

Lemmatization

Why Lemmatization is Preferred

Vectorization

Model Building

Euclidean Distance

Cosine Similarity

Why Use Cosine Similarity in My Model?

Saving Results

Graphs and Observations

Installation

React Native Deployment

Python Testing

Environment Variables

Installation Instructions

[React Native] Deployment

⚠️NOTE:

🔗Python

Conclusion

Connect

About

Releases

Packages

Languages

jagadeeshm007/Movie_Recommendation_System

Folders and files

Latest commit

History

Repository files navigation

Movie Recommendation System

Table of Contents

Reference Links

The following links are for Reference and documentation of the used modules and models.

Project Overview

Dataset

Imports and Data Loading

Data Preprocessing

Text Processing

Abstract Syntax Trees

Data Normalize

Stemming

Lemmatization

Why Lemmatization is Preferred

Vectorization

Model Building

Euclidean Distance

Cosine Similarity

Why Use Cosine Similarity in My Model?

Saving Results

Graphs and Observations

Installation

React Native Deployment

Python Testing

Environment Variables

Installation Instructions

[React Native] Deployment

⚠️NOTE:

🔗Python

Conclusion

Connect

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages