This branch is up to date with master.

Name	Name	Last commit message	Last commit date
Latest commit StephAO Update README.md Mar 17, 2021 6be204c · Mar 17, 2021 History 149 Commits
data_utils	data_utils	changes required for repo name change	Sep 23, 2020
evaluate	evaluate	changes required for repo name change	Sep 23, 2020
model	model	changes required for repo name change	Sep 23, 2020
optim	optim	changes required for repo name change	Sep 23, 2020
scripts	scripts	changes required for repo name change	Sep 23, 2020
.gitignore	.gitignore	add gitignore	Jun 25, 2019
LICENSE	LICENSE	Finalize code for public release (with files included...)	Sep 21, 2020
README.md	README.md	Update README.md	Mar 17, 2021
__init__.py	__init__.py	Combine nyu-mll/jiant and NVIDIA/Megatron-LM into one useable repo	Jun 25, 2019
arguments.py	arguments.py	changes required for repo name change	Sep 23, 2020
bert_config.json	bert_config.json	Code for main test	Apr 23, 2020
configure_data.py	configure_data.py	changes required for repo name change	Sep 23, 2020
convert_state_dict.py	convert_state_dict.py	Finalize code for public release (with files included...)	Sep 21, 2020
idf.p	idf.p	first step to fix load/save - not working yet	Feb 16, 2020
idf.py	idf.py	changes required for repo name change	Sep 23, 2020
learning_rates.py	learning_rates.py	initial commit	Mar 27, 2019
paths.py	paths.py	Finalize code for public release (with files included...)	Sep 21, 2020
pretrain_bert.py	pretrain_bert.py	changes required for repo name change	Sep 23, 2020
requirements.txt	requirements.txt	Finalize code for public release (with files included...)	Sep 21, 2020
utils.py	utils.py	remove mid epoch save/load, fix between epoch save/load	Feb 16, 2020

Repository files navigation

Code for "On Losses for Modern Language Models"
ACL Anthology, arxiv

This repository is primarily for reproducibility and posterity. It is not maintained.

Thank you to the NVIDIA/Megatron-LM and NYU's jiant repos for their code which helped create the base of this repo.

Setup

Only tested on python3.6.

python -m pip install virtualenv
virtualenv bert_env
source bert_env/bin/activate
pip install -r requirements.txt

Usage

The code enables pre-training a transformer (size specified in bert_config.json) using any combination of the following tasks (aka modes/losses): "mlm", "nsp", "psp", "sd", "so", "rg", "fs", "tc", "sc", "sbo", "wlen", "cap", "tf", "tf_idf", or "tgs". See paper for details regarding the modes. NOTE: PSP (previous sentence prediction) is equivalent to ASP (adjacent sentence prediction) from the paper. RG (referential game) is equivalent to QT (quick thoughts variant) from the paper.

They can be combined using any of the following methods:

Summing all losses (default, incompatible between a small subset of tasks, see paper for more detail)
Continuous Multi-Task Learning, based on ERNIE 2.0 (--continual-learning True)
Alternating between losses (--alternating True)

With the following modifiers:

Always using MLM loss (--always-mlm True, which is the default and highly recommended, see paper for more details)
Incrementally add tasks each epoch (--incremental)
Use data formatting for tasks, but zero out losses from auxiliary tasks (--no-aux True, not recommended, used for testing)

Set paths to read/save/load from in paths.py

To create datasets, see data_utils/make_dataset.py

For tf_idf prediction, you need to first calculate the idf score for your dataset. See idf.py for a script to do this.

Pre-training

To run pretraining : bash olfmlm/scripts/pretrain_bert.sh --model-type [model type] Where model type is the name of the model you want to train. If model type is one of the modes, it will train using mlm and that mode (if model type is mlm, it will train using just mlm). The --modes argument will override this default behaviour. If model type is not a specified mode, the--modes argument is required.

Distributed Pretraining

Use pretrain_bert_distributed.sh instead. bash olfmlm/scripts/pretrain_bert_distributed.sh --model-type [model type]

Evaluation

To run evaluation: You will need to convert the saved state dict of the required model using the convert_state_dict.py file. Then run: python3 -m olfmlm.evaluate.main --exp_name [experiment name] Where experiment name is the same as the model type above. If using a saved checkpoint instead of the best model, use the --checkpoint argument.

Citation

If this code was useful, please cite the paper:

@inproceedings{aroca-ouellette-rudzicz-2020-losses,
    title = "{O}n {L}osses for {M}odern {L}anguage {M}odels",
    author = "Aroca-Ouellette, St{\'e}phane  and
      Rudzicz, Frank",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.403",
    doi = "10.18653/v1/2020.emnlp-main.403",
    pages = "4970--4981",
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setup

Usage

Pre-training

Distributed Pretraining

Evaluation

Citation

About

Releases

Packages

Languages

License

StephAO/olfmlm

Folders and files

Latest commit

History

Repository files navigation

Setup

Usage

Pre-training

Distributed Pretraining

Evaluation

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages