Skip to content
/ dart Public

Modeling Complex Interactions in Long Documents for Aspect-Based Sentiment Analysis

Notifications You must be signed in to change notification settings

YanZehong/dart

Repository files navigation

Modeling Complex Interactions in Long Documents for Aspect-Based Sentiment Analysis

Python Versions pytorch Lightning license

This work introduces a hierarchical Transformer-based architecture called DART (Document-level Aspect-based Representation from Transformers) which effectively encodes information at different level of granularities with attention aggregation mechanisms to learn the local and global aspect-specific document representations.

Table of Contents

Project Structure

The directory structure of this project is:

├── configs            <- Hydra configuration files
│   ├── logdir            <- Logger configs
│   ├── data              <- Datamodule configs
│   ├── model             <- Modelmodule configs
│   ├── experiment        <- Experiment configs
│   └── cfg.yaml          <- Main config for training
│
├── dataset            <- Project data
├── datamodules        <- Datamodules (TripAdvisor, BeerAdvocate, SocialNews)
├── models             <- Models (DART, Longformer, BigBird)
├── logs               <- Logs generated by hydra and lightning loggers
├── outputs            <- Save generated data
├── utils              <- Utility scripts
│
├── run.py             <- Run Training and evaluation
└── README.md

Quickstart 🚀

Installation

Step 0. Download and install Miniconda from the official website.
Step 1. Install DART and dependencies.
Step 2. Specify root_dir in configs/cfg.yaml

# clone project
git clone https://github.com/YanZehong/dart
cd dart

# [OPTIONAL] create conda environment and activate it
conda create -n dart -y python=3.10 pip
conda activate dart

# install pytorch according to instructions
# https://pytorch.org/get-started/
# conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge

# install requirements
pip install -r requirements.txt

# IMPORTANT: modify the project path in configs/cfg.yaml
# root_dir: ''

Note: To install requirements, run pip install -r requirements.txt. Please ensure that you have met the prerequisites in PyTorch and install corresponding version.

Fine-tuning with DART

Train model with chosen experiment configuration from configs/experiment/. For different datasets, please use the recommended experimental settings (tripadvisor-dart, beeradvocate-dart and socialnews-dart).

python run.py experiment=socialnews-dart gpu=1

Train model with default configuration

# train on 1 GPU
python run.py gpu=1

# train with DDP (Distributed Data Parallel) (3 GPUs)
python run.py gpu=[0,1,2]

Warning: Currently there are problems with DDP mode, read this issue to learn more.

You can override any parameter from command line like this

python run.py gpu=3 train.num_epochs=10 train.batch_size=32

Note When you only specify some variables, other values/parameters will use default setting from configs/cfg.yaml.

Evaluate

# Specify the corresponding dataset 
# by setting DATA_NAME as trip_advisor/beer_advocate/social_news

# evaluate on 1 GPU
python eval.py gpu=1 data=DATA_NAME ckpt_path='/path/to/ckpt/name.ckpt'

# evaluate on cpu
python eval.py ckpt_path='/path/to/ckpt/name.ckpt'

Download Checkpoints

You can find fine-tuned checkpoints here.
We recommend using the following checkpoints :

Data Fine-tuned Checkpoint Size Accuracy
trip_advisor epoch=3-step=8694.ckpt 1962MB 86.36%
beer_advocate epoch=3-step=5936.ckpt 1962MB 88.13%
social_news epoch=4-step=840.ckpt 1962MB 83.81%
# Optionally, you can download them using wget or gdown as
# Next, unzip it to the directory `ckpt_path`
pip install gdown
gdown --folder https://drive.google.com/drive/folders/1OAJw4dLMSe5ySM2QUy2lBtNPgFd74k1c

Note: If you get an error mismatched input '=' expecting <EOF>, use the escape character \= to fix this problem. Or you can specify the value of ckpt_path in configs/cfg.yaml. Consider visiting that gdown page for full instructions, since the source repo may have more up-to-date instructions.

Use Miniconda for GPU environments

Use miniconda for your python environments (it's usually unnecessary to install full anaconda environment, miniconda should be enough). It makes it easier to install some dependencies, like cudatoolkit for GPU support. It also allows you to access your environments globally.

Example installation:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Create new conda environment:

conda create -n dart python=3.10
conda activate dart
Use torchmetrics

Use official torchmetrics library to ensure proper calculation of metrics. This is especially important for multi-GPU training!

For example, instead of calculating accuracy by yourself, you should use the provided Accuracy class like this:

from torchmetrics import Accuracy

class ModelModule(LightningModule):
    def __init__(self)
        self.train_acc = Accuracy()
        self.val_acc = Accuracy()

    def training_step(self, batch, batch_idx):
        ...
        acc = self.train_acc(predictions, targets)
        self.log("train/acc", acc)
        ...

    def validation_step(self, batch, batch_idx):
        ...
        acc = self.val_acc(predictions, targets)
        self.log("val/acc", acc)
        ...

Make sure to use different metric instance for each step to ensure proper value reduction over all GPU processes.

Torchmetrics provides metrics for most use cases, like F1 score or confusion matrix. Read documentation for more.

Follow PyTorch Lightning style guide

The style guide is available here.

Introduction

DART Architecture

Fig-3_DART

Overview of the model: DART overcomes the restriction of 512 tokens by splitting the long document into sentences or chunks of less than 512 tokens, and processing each sentence/chunk before aggregating the results. The proposed DART framework takes as input a document $d$ and an aspect $a_j$ and output the document representation $\hat{d}_j$ with respect to $a_j$. There are four key blocks in DART:

  • Sentence Encoding Block.
  • Global Context Interaction Block.
  • Aspect Aggregation Block.
  • Sentiment Classification Block.

Datasets

Experiments on multiple datasets including a curated dataset of long documents on social issues. Additionally, you can run fine-tuning of the downloaded model on your dataset of interest.

Dataset #aspects #docs #long docs (%) #sentences/doc #tokens/doc #tokens/sentence
TripAdvisor 7 28543 4027 (14.1%) 12.9 298.9 23.1
BeerAdvocate 4 27583 217 (0.8%) 11.1 173.5 15.7
SocialNews 6 4512 1031 (22.9%) 17.5 389.8 22.2

FAQ

What license is this library released under?

All code and models are released under the Apache 2.0 license. See the LICENSE file for more information.

I am getting out-of-memory errors, what is wrong?

All experiments in the paper were fine-tuned on a GPU/GPUs, which has 40GB of device RAM. Therefore, when using a GPU with 12GB - 16GB of RAM, you are likely to encounter out-of-memory issues if you use the same hyperparameters described in the paper. Additionally, different models require different amount of memory. Available memory also depends on the accelerator configuration (both type and count).

The factors that affect memory usage are:

  • data.max_num_seq: You can fine-tune with a shorter max sequence length to save substantial memory.

  • train.batch_size: The memory usage is also directly proportional to the batch size. You could decrease the train.batch_size=8 (and decrease train.lr accordingly) if you encounter an out-of-memory error.

  • model.backbone, base vs. large: The large model requires significantly more memory than base.

Citation

If you find this work useful, please cite as following:

@inproceedings{yan-etal-2024-modeling,
    title = "Modeling Complex Interactions in Long Documents for Aspect-Based Sentiment Analysis",
    author = "Yan, Zehong  and
      Hsu, Wynne  and
      Lee, Mong-Li  and
      Bartram-Shaw, David",
    booktitle = "Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, {\&} Social Media Analysis",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.wassa-1.3",
    pages = "23--34",
}

If we submit the paper to a conference or journal, we will update the BibTeX.

Contact information

For help or issues using DART, please submit a GitHub issue.

For personal communication related to DART, please contact me.


Yan Zehong

💻

About

Modeling Complex Interactions in Long Documents for Aspect-Based Sentiment Analysis

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages