ContraCLM: Contrastive Learning for Causal Language Model

This repository contains code for the ACL 2023 paper, ContraCLM: Contrastive Learning for Causal Language Model.

Work done by: Nihal Jain*, Dejiao Zhang*, Wasi Uddin Ahmad*, Zijian Wang, Feng Nan, Xiaopeng Li, Ming Tan, Ramesh Nallapati, Baishakhi Ray, Parminder Bhatia, Xiaofei Ma, Bing Xiang. (* indicates equal contribution).

Updates

[07-08-2023] Initial release of the code.

Overview

We present ContraCLM, a novel contrastive learning framework which operates at both the token-level and sequence-level. ContraCLM enhances the discrimination of representations from a decoder-only language model and bridges the gap with encoder-only models, making causal language models better suited for tasks beyond language generation. We encourage you to check out our paper for more details.

Setup

The setup involves installing the necessary dependencies in an environment and placing the datasets in the requisite directory.

Environment

Run these commands to create a new conda environment and install the required packages for this repository.

# create a new conda environment with python >= 3.8
conda create -n contraclm python=3.8.12

# install dependencies within the environment
conda activate contraclm
pip install -r requirements.txt

Datasets & Preprocessing

See here.

Pretraining

In this section, we show how to use this repository to pretrain (i) GPT2 on Natural Language (NL) data, and (ii) CodeGen-350M-Mono on Programming Language (PL) data.

Common Instructions

This section assumes that you have the train and validation data stored at TRAIN_DIR and VALID_DIR respectively, and are within an environment with all the above dependencies installed (see Setup).
You can get an overview of all the flags associated with pretraining by running:

python pl_trainer.py --help

Pretain `GPT2` on NL Data

Usage

bash runscripts/run_wikitext.sh

For quickly testing the code and debug, suggesting run the code with MLE loss only by setting CL_Config=$(eval echo ${options[1]}) within the script.
All other opotions involves CL loss at either token-level or sequence-level.

Pretrain `CodeGen-350M-Mono` on PL Data

Usage

Configure the variables at the top of runscripts/run_code.sh. There are lots of options but only the dropout options are explained here (others are self-explanatory):
- dropout_p: The dropout probability value used in torch.nn.Dropout
- dropout_layers: If > 0, this will activate the last dropout_layers with probability dropout_p
- functional_dropout: If specified, will use a functional dropout layer on top of the token representations output from the CodeGen model
Set the variable CL according to desired model configuration. Make sure the paths to TRAIN_DIR, VALID_DIR are set as desired.
Run the command: bash runscripts/run_code.sh

Evaluation

See the relevant task-specific directories here.

Citation

If you use our code in your research, please cite our work as:

@inproceedings{jain-etal-2023-contraclm,
    title = "{C}ontra{CLM}: Contrastive Learning For Causal Language Model",
    author = "Jain, Nihal  and
      Zhang, Dejiao  and
      Ahmad, Wasi Uddin  and
      Wang, Zijian  and
      Nan, Feng  and
      Li, Xiaopeng  and
      Tan, Ming  and
      Nallapati, Ramesh  and
      Ray, Baishakhi  and
      Bhatia, Parminder  and
      Ma, Xiaofei  and
      Xiang, Bing",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.355",
    pages = "6436--6459"
}

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
dataloader		dataloader
deepspeed		deepspeed
evaluation		evaluation
preprocess		preprocess
runscripts		runscripts
static		static
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
contrastive_losses.py		contrastive_losses.py
pl_args.py		pl_args.py
pl_data.py		pl_data.py
pl_model.py		pl_model.py
pl_trainer.py		pl_trainer.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ContraCLM: Contrastive Learning for Causal Language Model

Updates

Quick Links

Overview

Setup

Environment

Datasets & Preprocessing

Pretraining

Common Instructions

Pretain `GPT2` on NL Data

Usage

Pretrain `CodeGen-350M-Mono` on PL Data

Usage

Evaluation

Citation

Security

License

About

Releases

Packages

Contributors 2

Languages

License

amazon-science/ContraCLM

Folders and files

Latest commit

History

Repository files navigation

ContraCLM: Contrastive Learning for Causal Language Model

Updates

Quick Links

Overview

Setup

Environment

Datasets & Preprocessing

Pretraining

Common Instructions

Pretain GPT2 on NL Data

Usage

Pretrain CodeGen-350M-Mono on PL Data

Usage

Evaluation

Citation

Security

License

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Pretain `GPT2` on NL Data

Pretrain `CodeGen-350M-Mono` on PL Data

Packages