This project is to build a sentiment classifier to classify restaurant reviews. Data is in reviews.csv.
It has two approaches. One is using Bag of Words in Sklearn, the other is using the ULMfit in fastai.
- Tokenization: change the sentence into words.
- Lemmatization: standardize the words, like change went and goes to go. Examples in here.
- Remove stop words: Remove the stop words like the, a, which might influence the model.
- TF - IDF transform: Count the term frequency of the word, and calculate term-frequency times inverse document-frequency. Details in here.
- Build the model using SGD classifier
Fastai.text provides a pretrained NLP model basing on WikiText-103 dataset. All you need to do is to fine-tune the pre-trained model on your dataset and make prediction.
ULMFiT achieves good results by relying on techniques like:
- Discriminative fine-tuning (layer-specific learning rates)
- Slanted triangular learning rates (increasing and then decreasing learning rates over epochs)
- Gradual unfreezing (gradually unfreeze layers, starting from the last)
Runtime
This is deep-learning-NLP, and the harware matters. Colab provides both GPUs (graphics processing units) and TPU (tensor processing units). And if you have a Nvidia GPU, Nvidia cuda will help in speed up the process.