A text classification model from data collection, model training, and deployment.
The model can classify 675 different types of quote tags
The keys of deployment\tag_types_encoded.json
shows the quote tags
Data was collected from BrainyQuote Website Listing:
The data collection process is divided into 3 steps:
- Category & URL Scraping: The quotes URLs were scraped with
scraper\quote_url_scraper.py
and the URLs are stored along with the quote title inscraper\quotes_urls.csv
. - Quote Details Scraping: Using the URLs, the quotes, the authors, the category and the description URLs are scraped with
scraper\detail_data_scraper.py
and they are stored inscraper\quotes.csv
. - Tags Scraping: The final part was tag scraping and it was the difficult one. I split the total data into mini-batches and scrap tags with
scraper\tags_scraper.py
and stored indata\quotes_data.csv
.
In total, I scraped 1,01,243 quotes with their author's name, categories and relevant tags.
Initially, there were 9129 different tags in the dataset. After some analysis, I found out 8454 of them are rare (probably custom tags by users). So, I removed those tags and then I have 675 tags. After that, there were some duplicated values and I got a total of 101,201 samples after dropping them.
Finetuned distilroberta-base
and bert-base-uncased
models from HuggingFace Transformers using Fastai and Blurr. The model training notebook can be viewed in notebooks\quotes_data_prep_and_model_implement.ipynb
or
Models | Test Accuracy | F1 Score(Micro) | F1 Score(Macro) |
---|---|---|---|
Distil Roberta Base | - | 84.89 % | 52.42% |
Bert Base Uncased | - | 88.60 % | 67.15 % |
`Bert Base Uncased` found as a best model after the comparison. ## Model Compression and ONNX Inference
The trained bert-base-uncased
model has a memory of 422+MB. I compressed this model using ONNX quantization and brought it under 106MB.
The compressed model is deployed to the HuggingFace Spaces Gradio App. The implementation can be found in deployment
folder or
Deployed a Flask App built to take descriptions and show the tags as output. Check flask
branch. The website is live