This repository contains the implementation of an efficient and fast user intent classification system using an ensemble of logistic regression, SVM, and k-NN classifiers. The model leverages text embeddings from the jinaai/jina-embeddings-v2-base-es
to provide high accuracy while being resource-efficient compared to large language models (LLMs).
Medium (free) story link (for more details)
One of the key advantages of this model is that it does not require a GPU to run efficiently. The ensemble classifier, leveraging logistic regression, SVM, and k-NN, provides more than decent performance on a CPU. This makes it a viable and cost-effective alternative to running large language models or transformers in production, which typically require significant computational resources and higher costs. This approach ensures that the model is both accessible and practical for real-time applications without the need for specialized hardware.
This approach is also useful and more practical than training a BERT or SBERT classifier, as the embeddings and this ensemble do not require significant computational power. By leveraging pre-trained embeddings and an efficient ensemble of traditional machine learning models, we achieve high accuracy without the need for extensive computational resources.
In this project, I address the challenge of user intent classification in conversational AI pipelines, particularly for spanish language retrieve-augmented generation (RAG) systems. By combining multiple machine learning algorithms and calibrating their probabilities, the ensemble model achieves remarkable performance. This approach is designed to be significantly faster and more cost-effective than LLMs, making it suitable for real-time applications.
- Censorship
- Others
- Lead
- Contact
- Directions
- Meet
- Negation
- Affirmation
- Casual Chat
Intent | Precision | Recall | F1-Score | Support |
---|---|---|---|---|
Afirmación | 1.00 | 1.00 | 1.00 | 14 |
Censura | 0.99 | 1.00 | 0.99 | 539 |
Charla | 1.00 | 0.67 | 0.80 | 15 |
Contacto | 0.97 | 1.00 | 0.99 | 38 |
Direcciones | 1.00 | 1.00 | 1.00 | 71 |
Lead | 0.99 | 0.99 | 0.99 | 140 |
Meet | 0.97 | 1.00 | 0.98 | 29 |
Negación | 1.00 | 0.94 | 0.97 | 18 |
Otros | 0.98 | 0.97 | 0.98 | 171 |
Micro Avg | 0.99 | 0.99 | 0.99 | 1035 |
Macro Avg | 0.99 | 0.95 | 0.97 | 1035 |
Weighted Avg | 0.99 | 0.99 | 0.99 | 1035 |
This project showcases a fast and efficient method for user intent classification using an ensemble of machine learning models and text embeddings. While the current model achieves high accuracy, there is always room for improvement, especially in enhancing the training dataset for better generalization.
I invite you to clone the repository, test the model, and contribute to improving the dataset and model performance. Your feedback and suggestions are highly appreciated.
This model was train on spanish text corpus and only for text that represents question or requests, it wont perform well with multi turn chathistory or multiline long texts, it's recommended for 128< len token texts. Also the dataset is not big enought to generalize well, but the intention of this repository is to lay the foundations for an assembly of models that, with a larger dataset, should generalize very well for a real use case. DO NOT USE ON PRODUCTIVE ENVIRONMENTS.
If you build a larger dataset with longer text secuences or multi turn conversations and train the model, it should be work pretty well, the Jina embedding model support up to 8k tokens =)
If you found this project helpful, please give it a star and follow me for more insights on efficient machine learning techniques and natural language understanding innovations. Let's collaborate and push the boundaries of what's possible in user intent classification.
From Latam with ❤️