MULTILABEL MOVIE GENRE CLASSIFIER

Objective

The goal of this project was to build a multi-label text classification model that can predict movie genres from its description. The project includes data collection, cleaning, model training, deployment, and API integration.
The model can classify 21 different types of book genres
The keys of deployment\genre_types_encoded.json shows the book genres

Data Collection

Data was collected from IMDB official Website Listing.

Movie Details Scraping: movie title, description and genres are scraped using selenium library. The code can be found in scraper/scrape.py folder.

In total, 13,001 movie details were scraped till 1/8/24. The collection process is still in progress. the data are stored in data/movie_data.csv.
Update : 19500 movies collected(yet to be trained)

Data Preprocessing

Initially, duplicates and missing values were dropped. After that, two of the 23 genres(Film and Noir) were dropped as those genres are rare and training with them could potentially drop the accuracy of the model. The preprocessing can be found in notebooks/Project_NLP.ipynb. The cleaned dataset can be found in data folder.

Model Training

Finetuned a distilrobera-base model from HuggingFace Transformers using Fastai and Blurr. After five stages of training(with freeze and unfreeze), the best model was chosen to deploy(accuracy == 89.7). The model training notebook can be found in notebooks/Project_NLP.ipynb

Model Compression and ONNX Inference

The trained model has a memory of 300+MB. This model was compressed using ONNX quantization and brought under 80MB. The compression code can be found in notebooks/ONNX_Inference.ipynb and the model can be found in models folder.

Model Deployment

The compressed model is deployed to HuggingFace Spaces Gradio App. The implementation can be found in deployment folder.
Hugging face Space link here

Web Deployment

Deployed a Flask App built to take descprition and show the genres as output. Check the flask branch or click here. The website is live here

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
data		data
deployment		deployment
models		models
notebooks		notebooks
scraper		scraper
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MULTILABEL MOVIE GENRE CLASSIFIER

Objective

Data Collection

Data Preprocessing

Model Training

Model Compression and ONNX Inference

Model Deployment

Web Deployment

About

Releases

Packages

Languages

License

moinul-hossain-dhrubo/MultiLabel-Movie-Genre-Classifier

Folders and files

Latest commit

History

Repository files navigation

MULTILABEL MOVIE GENRE CLASSIFIER

Objective

Data Collection

Data Preprocessing

Model Training

Model Compression and ONNX Inference

Model Deployment

Web Deployment

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages