HARRY: Ron, please play 'Tere Naina'...
RON: Ummm.. Okay.. Playing 'Tere Naina'...
Question is - how did Ron completed the task here? Roughly, he first found the intent of the task, i.e., what needs to be done? Here, intent was 'PlayMusic'. So, Ron thus started a music app and played the above mentioned song.
What if we could automatically find out the intent of the user and responding likewise at earliest thus keeping the user sticking around the app? A win-win situation for both the user and the app creator.
Intent classification is the task of classifying the text into one of the many intents (multiclass-classification).
Aim of the project is to perform multiclass intent classification (and also subtask: topic modelling).
The data file contains approximately 15,000 data points. These data points are divided into three datasets: training, testing, and validation, and saved in three separate CSV files. Data analysis is performed on the training dataset. Data consists of two columns: text and intent. "Intent" is the target column, which contains 7 classes with an approximately equal distribution. It is also found that the text ranges from 10 to 125 characters approximately, and the range of words in the text ranges from 2 to 25. NER tag that was observed most across the documents: cardinal. There are also numerous "Date", "Person", and "GPE" - NER tags present across document. Further, we also found out the NER tags that dominated per class. Topic modelling is performed using LDA, and results are plotted using pyLDAvis.
Utilizing MLP, LSTM, and BERT (along with various embedding algorithms), intent classification is carried out and comparasion analysis is completed. Summarizing results:
-
- Model 1.1 :
INPUT => EMBEDDING (int encoding) => FLATTEN => FC => RELU => DO => FC => SOFTMAX
- Model 1.2 :
INPUT => EMBEDDING (int encoding) => GAP1D => FC => RELU => DO => FC => SOFTMAX
Model 1.1's flatten layer leads to more trainable parameters compared to model 1.2 that uses GlobalAveragePooling1D. Better performance is achieved by model 1.2.
-
- Model 2.1 :
INPUT => EMBEDDING (Glove Embedding) => GMP1D => FC => RELU => DO => FC => RELU => DO => FC => SOFTMAX
- Model 2.2 :
INPUT => EMBEDDING (Glove Embedding) => GAP1D => FC => RELU => DO => FC => RELU => DO => FC => SOFTMAX
Model 2.2 (uses GlobalAveragePool1D) is performing better than model 2.1 (uses GlobalMaxPool1D) in fewer epochs and in less wall time.
-
- Model 3.1 (Vanilla LSTM) :
INPUT => EMBEDDING (int encoding) => LSTM => FC => RELU => DO => FC => SOFTMAX
- Model 3.2 (BidirectionalLSTM) :
INPUT => EMBEDDING (int encoding) => BiLSTM => FC => RELU => DO => FC => SOFTMAX
- Model 3.3 (Stacked LSTM) :
INPUT => EMBEDDING (int encoding) => LSTM => LSTM => FC => RELU => DO => FC => SOFTMAX
Model 3.3 (stacked LSTM) with fewer parameters can achieve approximately similar or even better performance compared to model 3.1 (vanilla LSTM), but requires more training epochs to achieve the same. Model 3.2 (BidirectionalLSTM) achieved better performance compared to other two models.
The model quickly overfits the data. Most likely because the data isn't difficult or sophisticated for a BERT-like model to learn.
Following table summarizes the results:
Model No. | Trainable Params | Epochs | Train Acc | Train Loss | Valid Acc | Valid Loss | Test Acc | Test Loss | Wall Time |
---|---|---|---|---|---|---|---|---|---|
1.1 | 83,975 | 10 | 0.9690 | 0.1017 | 0.9675 | 0.1363 | 0.9671 | 0.1276 | 9.2s |
1.2 | 76,039 | 15 | 0.9885 | 0.0437 | 0.9712 | 0.1214 | 0.9710 | 0.1055 | 12.9s |
2.1 | 8,775 | 50 | 0.9604 | 0.1161 | 0.9420 | 0.1941 | 0.9406 | 0.2205 | 36.8s |
2.2 | 8,775 | 30 | 0.9661 | 0.1063 | 0.9564 | 0.1215 | 0.9562 | 0.1206 | 22.1s |
3.1 | 76,695 | 20 | 0.9677 | 0.0901 | 0.9658 | 0.3097 | 0.9617 | 0.3599 | 5m31s |
3.2 | 77,799 | 20 | 0.9654 | 0.1004 | 0.9672 | 0.1602 | 0.9632 | 0.2153 | 7m29s |
3.3 | 76,159 | 25 | 0.9549 | 0.1240 | 0.9447 | 0.3509 | 0.9499 | 0.3423 | 11m29s |
4 | - | 5 | 0.9885 | 0.0437 | 0.9712 | 0.1214 | 0.9757 | 0.1063 | 2m33s |
Overall, model 4, or the BERT model, provides the best performance. However, it has quite broad requirements. With fewer parameters, Model 1.2 produces performance that is nearly identical.
├── Data Files # Data files
└── ...
├── Models # Saved models
└── ...
├── Other Files # Miscellaneous files
└── ...
├── 1_data_analysis # Data Analysis file
├── 2_topic_modelling # Topic Modelling file
├── 3_mlp_intEncoding # MLP Classifier (with int encoding) file
├── 4_mlp_gloveEmbedding # MLP Classifier (with glove embedding) file
├── 5_lstm # LSTM Classifier (vanilla, stacked, bidirectional) file
├── 6_smallBert # smallBert Classifier file
- Language: Python
- Libraries: Tensorflow, Keras, Tensorflow-Hub, Scikit-Learn, Gensim, NLTK, Spacy, PyLDAviz, Re, WordCloud, Matplotlib, Seaborn, Numpy, Pandas.
Notebooks can be run directly on google colab (make sure to upload required .py files in working directory if required).
The codebase has been meticulously documented, incorporating comprehensive docstrings and comments. Please review these annotations, as they provide valuable insights into the functionality and operation of the code.