MEDIQA-Chat-2023-WangLab

This repository contains our submission (and the resulting short paper) to the MEDIQA-Chat Shared Task @ ACL-ClinicalNLP 2023

Installation

Requires python>=3.8. First, create and activate a virtual environment, then install the requirements:

pip install -r requirements.txt

Note: For setup on a cluster managed by the Alliance, please see ./scripts/slurm/setup_on_arc.sh.

Usage

Fine-tuning a model on the shared task data

Models can be fine-tuned on the shared task data using the run_summarization.py script, which is adapted from the HuggingFace run_summarization.py script. To see all available options, run:

python ./scripts/run_summarization.py --help

Arguments can be modified in the config files or passed as command-line arguments. Valid arguments are anything from the HuggingFace TrainingArguments, Seq2SeqTrainingArguments or arguments specified in the script itself. At a minimum, you must provide a path to the dataset partitions with train_file, validation_file and, optionally, test_file.

Training

To train the model, run one of the following:

# Task A (train)
python ./scripts/run_summarization.py "./conf/base.yml" "./conf/taskA.yml" \
    output_dir="./output/taskA"

# Task B (train)
python ./scripts/run_summarization.py "./conf/base.yml" "./conf/taskB.yml" \
    output_dir="./output/taskB"

Note: base.yml contains good default arguments that should be used for all experiments. taskA.yml/taskB.yml contain arguments specific to Task A/B. Arguments passed via the command line arguments will override those in the config files.

Validation

To evaluate a trained model on the validation set, run one of the following:

# Task A
python ./scripts/run_summarization.py "./conf/base.yml" "./conf/taskA.yml" \
    output_dir="./output/taskA/fine_tune" \
    model_name_or_path="./path/to/model/checkpoint" \
    do_train=False \
    do_eval=True

# Task B
python ./scripts/run_summarization.py "./conf/base.yml" "./conf/taskB.yml" \
    output_dir="./output/taskB/fine_tune" \
    model_name_or_path="./path/to/model/checkpoint" \
    do_train=False \
    do_eval=True

Testing

To make predictions with a trained model on the test set, see the Submission.

By default, the model will be evaluated by ROUGE, BERTScore and BLEURT. You can change the underlying models for BERTScore and BLEURT by modifying the bertscore_model_type and bleurt_checkpoint arguments. We choose reasonable defaults here, which balance model size and evaluation time with automatic metric performance. For more information on possible models and metric performance, see here for BERTScore and here for BLEURT.

Results will be automatically logged to any integrations that are installed and supported by the HuggingFace trainer. If do_predict=True, a file which contains the model's predictions formatted for submission to the challenge task will be saved to output_dir / "taskX_wanglab_runY.csv". X corresponds to the script argument task and Y to the script argument run.

We also provide a SLURM submission script for ARC clusters, which can be found at ./scripts/slurm/run_summarization.sh.

Generate notes with LangChain

To generate notes with a large language model (LLM, via LangChain), use the run_langchain.py script. To see all available options, run:

python ./scripts/run_langchain.py --help

To reproduce our best results for Task B, run the following:

# Task B
OPENAI_API_KEY="..." python scripts/run_langchain.py \
    "./MEDIQA-Chat-TestSets-March-15-2023/TaskB/taskB_testset4participants_inputConversations.csv" \
    "./output/taskB/in_context_learning" \
    --train-fp "./MEDIQA-Chat-Training-ValidationSets-Feb-10-2023/TaskB/TaskB-TrainingSet.csv" \
    --task "B" \
    --run "1"

You will need to provide your own OPENAI_API_KEY.

Note: Due to the non-deterministic nature of OpenAI's models and API, results may vary slightly from our reported results.

Pre-trained models, outputs and results

All model outputs and results (as well as data from the human evaluation) reported in our paper are available in the data/paper directory.

Submitting to the shared task

To submit a run to the shared task we used the following commands:

./scripts/submission/install.sh
./scripts/submission/activate.sh
# Then, choose one of the decode scripts, e.g.
./scripts/submission/decode_taskA_run1.sh

The submission scripts also demonstrate how to make predictions on the test set using a trained model.

Citing

If you use our model in your work, please consider citing our paper:

@inproceedings{giorgi-etal-2023-wanglab,
	title        = {{W}ang{L}ab at {MEDIQA}-Chat 2023: Clinical Note Generation from Doctor-Patient Conversations using Large Language Models},
	author       = {Giorgi, John  and Toma, Augustin  and Xie, Ronald  and Chen, Sondra  and An, Kevin  and Zheng, Grace  and Wang, Bo},
	year         = 2023,
	month        = jul,
	booktitle    = {Proceedings of the 5th Clinical Natural Language Processing Workshop},
	publisher    = {Association for Computational Linguistics},
	address      = {Toronto, Canada},
	pages        = {323--334},
	url          = {https://aclanthology.org/2023.clinicalnlp-1.36},
	abstract     = {This paper describes our submission to the MEDIQA-Chat 2023 shared task for automatic clinical note generation from doctor-patient conversations. We report results for two approaches: the first fine-tunes a pre-trained language model (PLM) on the shared task data, and the second uses few-shot in-context learning (ICL) with a large language model (LLM). Both achieve high performance as measured by automatic metrics (e.g. ROUGE, BERTScore) and ranked second and first, respectively, of all submissions to the shared task. Expert human scrutiny indicates that notes generated via the ICL-based approach with GPT-4 are preferred about as often as human-written notes, making it a promising path toward automated note generation from doctor-patient conversations.}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

MEDIQA-Chat-2023-WangLab

Table of contents

Installation

Usage

Fine-tuning a model on the shared task data

Training

Validation

Testing

Generate notes with LangChain

Pre-trained models, outputs and results

Submitting to the shared task

Citing

Files

README.md

Latest commit

History

README.md

File metadata and controls

MEDIQA-Chat-2023-WangLab

Table of contents

Installation

Usage

Fine-tuning a model on the shared task data

Training

Validation

Testing

Generate notes with LangChain

Pre-trained models, outputs and results

Submitting to the shared task

Citing