Skip to content

FrancescoGradi/DocumentClassificationwithHANandBERT

Repository files navigation

Document Classification with HAN, LSTM and BERT

Thanks to Deep Learning, Natural Language Processing (NLP) has grown a lot over the past few years. This project deals with some of the latest techniques of Document Classification, an important task in NLP. It consists to assign a document to one category. If category is actually a sentiment (a numeric evaluation of text), we talk about Sentiment Analysis, like datasets taken in this implementation.

This Project

In this project, we want to replicate some literature experiments and compare different approaches:

  • HAN is a sophisticated model based on Recurrent Neural Networks, in particular GRU (special version to remember long term dependencies), combined with some hierarchical attention mechanisms, that consider words and sentences differently. The idea is to let the model to pay more or less attention to individual words and sentences when constructing the representation of document (required for classification) [1].

  • BERT has an architecture based on Transformers: they are layers with strong multi-head attention and others techniques like positional encoding and residual connections, without RNNs. Base version has 12 transformers-encoders, while Large has 24 [2].

  • LSTM is a RNN architecture based on single bilateral LSTM layer (like GRU, LSTM has some gates to avoid vanishing gradient problem), with appropriate regularization and without attention mechanisms [3].

  • KD-LSTM: same authors of previous model proposed a Knowledge Distillation version of their LSTM, thanks to BERT. The main idea is to use a big teacher model (BERT, in this case) to distill information to a smaller, faster, student network (LSTM) to achieve better results [4].

Datasets

To test models, three Sentiment Analysis datasets were chosen:

  • IMDB Small: short version of IMDB (about 25 000 film reviews), with only two sentiments: positive or negative. This dataset can be retrieved in TensorFlow datasets.

  • IMDB: large version with about 135 000 reviews and 10 sentiments (stars from 1 to 10). It was found here.

  • Yelp 2014: dataset of restaurants reviews with sentiment from 1 to 5. It was harder to retrieve the right year version, so we downloaded the complete version and we selected only 2014 reviews by hand (Python script, in utils.py). Total reviews are about 900 000 versus 1 million in original dataset (maybe the original includes some 2013 or 2015 data). Anyway results are similar.

Results

We report accuracy on test set for every dataset and model. BERT reaches better results, but it is also the heaviest network. LSTM also achieves good results, without attention mechanisms. Almost all tests are quite similar to cited papers results.

Model IMDB Small IMDB Yelp 2014
HAN 86.6 46.4 69.0
BERT_base 94.6 57.7 77.4
LSTM_reg 94.2 52.7 71.1
KD-LSTM_reg 94.6 58.5 71.7

Visualization of Attention in HAN

This code allows to visualize attention in HAN model (with hanPredict function), because it is relative easy to extract partial model weights to reconstruct the most attentioned words and sentences. Here two reviews from Yelp, blue represents most important sentences and red most relevant words.

HAN PREDICTION: 5, TARGET: 5. Here the word 'bad' it has been well interpreted based on context.

HAN PREDICTION: 1, TARGET: 1. They were attentioned the first and the last sentences, the word 'recommend' here has a different sense, because context is different.

Reproducing Experiments

Dependencies

This project uses PyTorch and TensorFlow 2 (only for HAN model), for training GPU is needed. Code was developed and tested with these main dependencies:

  • Python 3.7.7
  • numpy 1.18.1
  • ntlk 3.4.5
  • pandas 1.0.3
  • pytorch 1.4.0
  • tensorboard 2.1.0
  • tensorflow 2.1.0
  • transformers 2.10.0

All dependencies can be installed, after cloned this repository, with command line:

$ pip install -r requirements.txt

For Han Preprocessing glove.6B.100d.txt is also required (for this project was chosen 100 dimension version) that can be retrieved in GloVe site.

How make it works

The pipeline is getting the dataset in pandas dataframe format (there are some utils functions, code expects dataset in datasets/ local directory), preprocessing (it automatically splits train, valid and test sets), training and evaluating. Here a main.py example:

from preprocessing import bertPreprocessing
from train import lstmTrain
from utils import readIMDB

dataset_name, n_classes, data_df = readIMDB()

bertPreprocessing(dataset_name, data_df, MAX_LEN=128)
lstmTrain(dataset_name, n_classes, TRAIN_BATCH_SIZE=64, EPOCHS=20, LEARNING_RATE=1e-03)

Now logs are continuously saved and update. Tensorboard is a good tool to tracking and visualizing metrics during training:

$ tensorboard --logdir logs/IMDB_lstm

At the end the model is saved in models/model_IMDB_lstm/, it is possible to evaluate model results on test set adding the model path and running this function:

from predict import lstmEvaluate

lstmEvaluate('IMDB', 10, model_path='models/model_IMDB_lstm/20200618-133908')

Report

A copy of the report (italian) can be found here.

References

[1] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1480–1489.

[2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.

[3] Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and Jimmy Lin. 2019. Rethinking complex neural network architectures for document classification. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4046–4051.

[4] Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and Jimmy Lin. 2019. DocBERT: BERT for Document Classification. Arvix.

About

Comparison between RNNs and Attention in Document Classification

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages