Skip to content
This repository has been archived by the owner on Dec 20, 2023. It is now read-only.

BERT for RACE dataset MRC task with pytorch-lightning

License

Notifications You must be signed in to change notification settings

iamNCJ/bert-race-pytorch-lightning

Repository files navigation

bert-race-pytorch-lightning

This repo contains the code for ASC20-21 LE

BERT for RACE with pytorch-lightning and transformer

Implemented DCMN (reference code) and DUMA

This repo is for experimental purposes. In order to achieve the best performance on distributed systems, we ejected the code from pytorch-lightening to native pytorch and changed the model from the one implemented by huggingface to Nvidia's.

For the ejected version, check out bert-race-nvidia.

File Structure

.
├── data
│   ├── RACE
│   │   ├── dev
│   │   ├── test
│   │   └── train
│   ├── RACEDataModule.py
│   ├── RACEDataModuleForALBERT.py
│   └── RACELocalLoader.py
├── model
│   ├── bert-large-uncased
│   │   ├── config.json
│   │   ├── pytorch_model.bin
│   │   └── vocab.txt
│   ├── ALBERTForRace.py
│   ├── BertForRace.py
│   ├── BertLongAttention.py
│   ├── BertPooler.py
│   ├── CheckptEnsemble.py
│   ├── DCMNForRace.py
│   ├── DUMAForRace.py
│   ├── FuseNet.py
│   └── SSingleMatchNet.py
├── plugins
│   ├── ApexDDP.py
│   └── ApexDDPAccelerator.py
├── result
│   └── asc01
├── hp_optimize.py
├── train.py
├── predict.py
├── README.md
├── requirements.txt
└── LICENSE

Please put the data and pre-trained model into data and model as above.

Environment

pip install -r requirements.txt

You need to install apex separately

On Cluster

scl enable devtoolset-9 bash
conda activte [env]
# then compile and install apex and other modules

Install horovod

HOROVOD_NCCL_LIB=/usr/lib64/ HOROVOD_NCCL_INCLUDE=/usr/include/ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_NCCL_LINK=SHARED pip install --no-cache-dir horovod

Features

  • PyTorch Lightening
  • Transformer
  • Refactor RACE Dataset Loader
    • Use datasets from transformer
    • Better Interface and Format
    • Faster Data Loading (Using rust & multi-process)
    • Cache Tokenized Results
    • Custom Datasets (Local loading)
  • Mix Precision Training (Apex)
  • TensorBoard Logging
    • Change Log Dir
  • Text Logging (Should be same as baseline code, override pl original progress bar, will be done after ejection)
  • Argparse (Not that important)
  • Inference & Answer Saving
  • Hyper Parameter Tuning (Optuna)
    • More parameters (Will be done in ejection)
  • Parallelism
    • FairShare
    • DeepSpeed (Unstable)
  • Distributed (Will be done after ejection)
    • DDP
    • Apex DDP (Given up)
    • Apex + Horovod (Given up)
  • Cross Validation (Useless)
  • Data Augmentation (Useless)
  • Model Tweak
    • DCMN (Bad test result (acc around 60 only, far lower than the paper's result) && buggy now, I'm not going to debug it anymore, if anyone wants to use it, please checkout a working commit #1df19a5)
    • DUMA
    • Sentence Selection (Bad result)
    • Sliding Window (Bad result)
    • Rouge Score (small improvement on short sequences)
    • Use features from previous layers (Useless)
  • Model Ensemble (Buggy, will be done after ejection)
  • Find Best Seed (Useless, there will be new datasets and pre-trained model on-site)
  • Further Speedup Training Process
    • LongFormer (Seems useless)
    • Nvidia Bert (Will be done in ejection)