- Intent Classification Service
- Instructions
- API Documentation
- Data Preparation and Model Training
- Evaluation
- Future Improvements
Your task is to implement a neural network-based intent classifier that can be used to provide inferencing service via an HTTP Service. The boiler plate for the Service is implemented in file server.py
and you'll have to implement the API function for inferencing as per the API documentation provided below. The neural network interface has been defined in intent_classifer.py
. You can add any methods and functionality to this class you deem necessary.
You may use any deep learning library (Tensorflow, Keras, PyTorch, ...) you wish and you can also use pre-existing components for building the network architecture if these would be useful in real-life production systems. Provide tooling and instructions for training the network from scratch.
Also provide a jupyter notebook for model development which trains and tests the model. The final output of this notebook should be trained models with respect to their tests. Evaluation metrics should include the following:
- Accuracy
- Precision
- Recall
- F1
- Any other metric that you think is suitable for the comparison
In addition, the same notebook provides a section to evaluate models in production. Assuming the following scenario:
You have both of the models in production and no labeled data is available to you. How would you compare them? Which metrics would you use for this kind of comparison? For example, you can use metrics based on confidence values or related ones.
In this notebook we are training a multilingual intent classification, for the purpose of POC I am selecting following languages:
- English
- Hindi
- Spanish
The given ATIS dataset is provided in English, I have created a parallel dataset using google translation for Hindi and Spanish.
In the training dataset, we have the following distribution of data:
flight 3426
airfare 403
ground_service 235
airline 148
abbreviation 108
aircraft 78
flight_time 52
quantity 49
distance 20
city 18
airport 18
ground_fare 17
flight+airfare 17
capacity 16
flight_no 12
meal 6
restriction 5
airline+flight_no 2
ground_service+ground_fare 1
airfare+flight_time 1
cheapest 1
aircraft+flight+flight_no 1
From what I can observe, the data is quite unbalanced and also some of the classes seems to be a combination of others, which gives me an indication that we can perhaps model this as a multi-label classification problem. Also, it is possible that the user query may have multiple intents and a multi-label model is a good choice to handle such a scenario.
For this reason, I have transformed the created a multi-label dataset, with the following 17 classes:
'ground_service', 'abbreviation', 'ground_fare', 'airline', 'city',
'aircraft', 'flight_no', 'airport', 'flight', 'quantity', 'meal',
'capacity', 'restriction', 'airfare', 'distance', 'flight_time', 'cheapest'
data
- The provided ATIS dataset- downlaod link:
https://drive.google.com/drive/folders/1I2cALZXOIaz9WnmdtubpVavflkUvFRHG?usp=share_link
- downlaod link:
data_mlabel
- The multilabel version of provided ATIS dataset- download link:
https://drive.google.com/drive/folders/1-0VzuUa16j3nEcqywHVKlJONMPg0F40S?usp=share_link
- download link:
multilingual_data
- Parallel Multilingual translated data of provdied ATIS dataset- download link -
https://drive.google.com/drive/folders/1A-t73esVP27KbC9eAEMBv6klu8jpQSd-?usp=share_link
- download link -
The model architecture is a simple one, which I believe is a strong baseline for the task and can be used for handling multi-lingual queries.
-
Encoder
As I am tring to train a multilingual model, the first step in the NLP pipeline would be to have a
bert-base-multilingual-cased
from huggingface-transformers. This BERT varient is a pretrained model on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective. We can easily extend our approach to handle queries in 104 languages, however the performance might differ between languages depending on the amount of data used in multi-lingual BERT pretraining. -
Decoder (classifier)
The decoder is a single linear layer mapping the encoder output to our 17-output classes.
-
Loss Function
Binary Cross Entropy is a suitable loss function for multi-label modeling in this scenario.
IntentClassifier(
(bert): BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(119547, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0-11): 12 x BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
(classifier): Linear(in_features=768, out_features=17, bias=True)
(criterion): BCEWithLogitsLoss()
)
| Name | Type | Params
-------------------------------------------------
0 | bert | BertModel | 177 M
1 | classifier | Linear | 13.1 K
2 | criterion | BCEWithLogitsLoss | 0
-------------------------------------------------
177 M Trainable params
0 Non-trainable params
177 M Total params
711.466 Total estimated model params size (MB)
-
Create a conda environment using:
conda create -n "intent-clf-env" python=3.10.11
-
Activate the conda environment:
conda activate intent-clf-env
-
Pylint Setup:
conda install -c conda-forge pre_commit
pre-commit install
pre-commit autoupdate
- Note: Not keeping these dependencies in
requirements.txt
as they are only required for development purpose.
-
Install dependencies
pip install -r requirements.txt
-
In the repository you will find
.env.bkp
file. You need to create a copy of the file:cp .env.bkp .env
- Setup the environment variables:
-
PORT=8080 CHECKPOINT_PATH="./model/best-checkpoint-v1.ckpt" ML_BINARIZER_PATH="./model/ml_binarizer.pkl"
-
-
Download the model files in the
models
directory:best-checkpoint-v1.ckpt
-https://drive.google.com/file/d/1-P-mIf9ChF04LzZ63EZiitUiGthVfYzm/view?usp=share_link
ml_binarizer.pkl
-https://drive.google.com/file/d/1-Q5671xZmR54XSChXq9yv1yFN41-tZVJ/view?usp=share_link
-
Running the Flask server
python server.py
- The code has a
Dockerfile
pre-setup - In the repository you will find
.env.bkp
file. You need to create a copy of the file:cp .env.bkp .env
- Setup the environment variables:
-
PORT=8080 CHECKPOINT_PATH="./model/best-checkpoint-v1.ckpt" ML_BINARIZER_PATH="./model/ml_binarizer.pkl"
-
- Download the model files in the
models
directory:best-checkpoint-v1.ckpt
-https://drive.google.com/file/d/1-P-mIf9ChF04LzZ63EZiitUiGthVfYzm/view?usp=share_link
ml_binarizer.pkl
-https://drive.google.com/file/d/1-Q5671xZmR54XSChXq9yv1yFN41-tZVJ/view?usp=share_link
- Build docker images:
sudo docker build . -t lordzuko/intent-clf-service:v1.0.0
- Running docker container
sudo docker run -p 8080:8080 lordzuko/intent-clf-service:v1.0.0
- DockerHub
docker pull lordzuko/intent-clf-service:v1.0.0
- The Dockerfile currently does not support GPU usage even though the inference code works with GPU. This is because the chosen base image does not have NVIDIA-CUDA drivers installed.
The documentation provides how to use the API, with python
, cURL
etc.
- POSTMAN documentation for the service can be found here.
- The process for model training and evaluation is described in notebook:
notebooks/multi_lingual_multilabel_intent_clf.ipynb
- The process for model evalation during production scenario is described in notebook:
notebooks/Production_Evaluation.ipynb
Best Threshold: 0.30
Train Accuracy: 0.999
Val Accuracy: 0.998
Test Accuracy: 0.995
Best Threshold: 0.20
Train AUROC: 0.874
Val AUROC: 0.833
Test AUROC: 0.805
You can format the validation and test results for markdown as follows:
Label | Validation AUROC | Test AUROC |
---|---|---|
abbreviation | 1.000000 | 0.999959 |
aircraft | 0.998834 | 0.997295 |
airfare | 0.999978 | 0.989687 |
airline | 0.999845 | 0.992285 |
airport | 1.000000 | 0.999959 |
capacity | 1.000000 | 0.999260 |
cheapest | 0.000000 | 0.000000 |
city | 1.000000 | 0.936331 |
distance | 1.000000 | 1.000000 |
flight | 0.999852 | 0.989004 |
flight_no | 0.992795 | 1.000000 |
flight_time | 0.992006 | 1.000000 |
ground_fare | 1.000000 | 0.997722 |
ground_service | 0.998485 | 0.999989 |
meal | 0.992806 | 0.926189 |
quantity | 0.999711 | 0.996064 |
restriction | 0.000000 | 0.000000 |
Label | Train Support | Validation Support | Test Support | Validation Precision | Test Precision | Validation Recall | Test Recall | Validation F1-Score | Test F1-Score |
---|---|---|---|---|---|---|---|---|---|
abbreviation | 309.0 | 15.0 | 78.0 | 1.00 | 0.97 | 1.00 | 0.99 | 1.00 | 0.98 |
aircraft | 227.0 | 10.0 | 24.0 | 0.83 | 0.88 | 1.00 | 0.88 | 0.91 | 0.88 |
airfare | 1190.0 | 73.0 | 183.0 | 0.97 | 0.97 | 1.00 | 0.93 | 0.99 | 0.95 |
airline | 421.0 | 29.0 | 90.0 | 1.00 | 0.96 | 0.97 | 0.96 | 0.98 | 0.96 |
airport | 53.0 | 1.0 | 39.0 | 1.00 | 0.95 | 1.00 | 0.97 | 1.00 | 0.96 |
capacity | 46.0 | 2.0 | 63.0 | 1.00 | 1.00 | 1.00 | 0.94 | 1.00 | 0.97 |
cheapest | 3.0 | 0.0 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
city | 51.0 | 3.0 | 15.0 | 1.00 | 1.00 | 1.00 | 0.40 | 1.00 | 0.57 |
distance | 58.0 | 2.0 | 30.0 | 1.00 | 0.39 | 1.00 | 1.00 | 1.00 | 0.57 |
flight | 9822.0 | 510.0 | 1881.0 | 1.00 | 0.99 | 0.99 | 0.98 | 1.00 | 0.99 |
flight_no | 43.0 | 2.0 | 27.0 | 1.00 | 1.00 | 0.50 | 1.00 | 0.67 | 1.00 |
flight_time | 151.0 | 8.0 | 3.0 | 0.78 | 0.27 | 0.88 | 1.00 | 0.82 | 0.43 |
ground_fare | 51.0 | 3.0 | 21.0 | 1.00 | 0.29 | 1.00 | 0.95 | 1.00 | 0.44 |
ground_service | 672.0 | 36.0 | 108.0 | 0.97 | 0.92 | 1.00 | 1.00 | 0.99 | 0.96 |
meal | 17.0 | 1.0 | 18.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
quantity | 142.0 | 5.0 | 9.0 | 0.83 | 0.21 | 1.00 | 1.00 | 0.91 | 0.35 |
restriction | 15.0 | 0.0 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Summary | Train Support | Validation Support | Test Support | Validation Precision | Test Precision | Validation Recall | Test Recall | Validation F1-Score | Test F1-Score |
---|---|---|---|---|---|---|---|---|---|
micro avg | 13271.0 | 700.0 | 2589.0 | 0.99 | 0.93 | 0.99 | 0.97 | 0.99 | 0.95 |
macro avg | 13271.0 | 700.0 | 2589.0 | 0.79 | 0.64 | 0.78 | 0.76 | 0.78 | 0.65 |
weighted avg | 13271.0 | 700.0 | 2589.0 | 0.99 | 0.96 | 0.99 | 0.97 | 0.99 | 0.96 |
samples avg | 13271.0 | 700.0 | 2589.0 | 0.99 | 0.95 | 0.99 | 0.97 | 0.99 | 0.96 |
- The model is currently being loaded from checkpoint, which has optimizer and gradient placeholders, which is increasing the file size. A better way would be to save model dict and load the model from state dict. This will reduce the model file size, which will ultimately decrease the size of resulting docker image.
- Evaluation is not done at language level, which can be important to make any language specific improvements and updates.
- The
/intent
api endpoint does not provide anapi_version
field which could be important to keep track of the current version of intent classification service which is producing the output. The downstream applications might take advantage of it to handle business logic or update them according to api_version.- It can also be important in AB-testing
- Also, data management could take advantage of this field
- The
/intent
api does not check for the language. This could be problametic as our model currently only supports for 3 languages, whereas the tokenizer we are using can support 104 langauges. Not detecting the supported languages could lead to unforseen model performance issues. - Data imbalance among intent classes is currently a bottlenect and upsampling needs to be down for low data classes. We can use translation and paraphrasing to tackle the data imbalance issues.
- Update Dockerfile to support GPU usage.
NannyML
installation has numpy version mismatch with the rest of the code, so it might create issues with building Docker image and has not been tested yet. However, given that it is not required during inference, I suggest install it in separate environment for testing for now.