Multilabel Quote Classifier

A text classification model from data collection, model training, and deployment.
The model can classify 675 different types of quote tags
The keys of deployment\tag_types_encoded.json shows the quote tags

Data Collection

Data was collected from BrainyQuote Website Listing:
The data collection process is divided into 3 steps:

Category & URL Scraping: The quotes URLs were scraped with scraper\quote_url_scraper.py and the URLs are stored along with the quote title in scraper\quotes_urls.csv.
Quote Details Scraping: Using the URLs, the quotes, the authors, the category and the description URLs are scraped with scraper\detail_data_scraper.py and they are stored in scraper\quotes.csv.
Tags Scraping: The final part was tag scraping and it was the difficult one. I split the total data into mini-batches and scrap tags with scraper\tags_scraper.py and stored in data\quotes_data.csv.

In total, I scraped 1,01,243 quotes with their author's name, categories and relevant tags.

Data Preprocessing

Initially, there were 9129 different tags in the dataset. After some analysis, I found out 8454 of them are rare (probably custom tags by users). So, I removed those tags and then I have 675 tags. After that, there were some duplicated values and I got a total of 101,201 samples after dropping them.

Model Training

Finetuned distilroberta-base and bert-base-uncased models from HuggingFace Transformers using Fastai and Blurr. The model training notebook can be viewed in notebooks\quotes_data_prep_and_model_implement.ipynb or

Result Comparison

Models	Test Accuracy	F1 Score(Micro)	F1 Score(Macro)
Distil Roberta Base	-	84.89 %	52.42%
Bert Base Uncased	-	88.60 %	67.15 %

`Bert Base Uncased` found as a best model after the comparison. ## Model Compression and ONNX Inference

The trained bert-base-uncased model has a memory of 422+MB. I compressed this model using ONNX quantization and brought it under 106MB.

Model Deployment

The compressed model is deployed to the HuggingFace Spaces Gradio App. The implementation can be found in deployment folder or

Web Deployment

Deployed a Flask App built to take descriptions and show the tags as output. Check flask branch. The website is live

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
data		data
dataloaders		dataloaders
deployment		deployment
models		models
notebooks		notebooks
scraper		scraper
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multilabel Quote Classifier

Data Collection

Data Preprocessing

Model Training

Result Comparison

Model Deployment

Web Deployment

About

Packages

Languages

License

kavinh07/Multilabel-Quote-Classifier

Folders and files

Latest commit

History

Repository files navigation

Multilabel Quote Classifier

Data Collection

Data Preprocessing

Model Training

Result Comparison

Model Deployment

Web Deployment

About

Topics

Resources

License

Stars

Watchers

Forks

Packages 0

Languages

Packages