This repository contains our submission (and the resulting short paper) to the MEDIQA-Chat Shared Task @ ACL-ClinicalNLP 2023
Requires python>=3.8. First, create and activate a virtual environment, then install the requirements:
pip install -r requirements.txt
Note: For setup on a cluster managed by the Alliance, please see
./scripts/slurm/setup_on_arc.sh
.
Models can be fine-tuned on the shared task data using the run_summarization.py
script, which is adapted from the HuggingFace run_summarization.py
script. To see all available options, run:
python ./scripts/run_summarization.py --help
Arguments can be modified in the config files or passed as command-line arguments. Valid arguments are anything from the HuggingFace TrainingArguments
, Seq2SeqTrainingArguments
or arguments specified in the script itself. At a minimum, you must provide a path to the dataset partitions with train_file
, validation_file
and, optionally, test_file
.
To train the model, run one of the following:
# Task A (train)
python ./scripts/run_summarization.py "./conf/base.yml" "./conf/taskA.yml" \
output_dir="./output/taskA"
# Task B (train)
python ./scripts/run_summarization.py "./conf/base.yml" "./conf/taskB.yml" \
output_dir="./output/taskB"
Note:
base.yml
contains good default arguments that should be used for all experiments.taskA.yml
/taskB.yml
contain arguments specific to Task A/B. Arguments passed via the command line arguments will override those in the config files.
To evaluate a trained model on the validation set, run one of the following:
# Task A
python ./scripts/run_summarization.py "./conf/base.yml" "./conf/taskA.yml" \
output_dir="./output/taskA/fine_tune" \
model_name_or_path="./path/to/model/checkpoint" \
do_train=False \
do_eval=True
# Task B
python ./scripts/run_summarization.py "./conf/base.yml" "./conf/taskB.yml" \
output_dir="./output/taskB/fine_tune" \
model_name_or_path="./path/to/model/checkpoint" \
do_train=False \
do_eval=True
To make predictions with a trained model on the test set, see the Submission.
By default, the model will be evaluated by ROUGE, BERTScore and BLEURT. You can change the underlying models for BERTScore and BLEURT by modifying the bertscore_model_type
and bleurt_checkpoint
arguments. We choose reasonable defaults here, which balance model size and evaluation time with automatic metric performance. For more information on possible models and metric performance, see here for BERTScore and here for BLEURT.
Results will be automatically logged to any integrations that are installed and supported by the HuggingFace trainer. If do_predict=True
, a file which contains the model's predictions formatted for submission to the challenge task will be saved to output_dir / "taskX_wanglab_runY.csv"
. X
corresponds to the script argument task
and Y
to the script argument run
.
We also provide a SLURM submission script for ARC clusters, which can be found at
./scripts/slurm/run_summarization.sh
.
To generate notes with a large language model (LLM, via LangChain), use the run_langchain.py
script. To see all available options, run:
python ./scripts/run_langchain.py --help
To reproduce our best results for Task B, run the following:
# Task B
OPENAI_API_KEY="..." python scripts/run_langchain.py \
"./MEDIQA-Chat-TestSets-March-15-2023/TaskB/taskB_testset4participants_inputConversations.csv" \
"./output/taskB/in_context_learning" \
--train-fp "./MEDIQA-Chat-Training-ValidationSets-Feb-10-2023/TaskB/TaskB-TrainingSet.csv" \
--task "B" \
--run "1"
You will need to provide your own OPENAI_API_KEY
.
Note: Due to the non-deterministic nature of OpenAI's models and API, results may vary slightly from our reported results.
All model outputs and results (as well as data from the human evaluation) reported in our paper are available in the data/paper directory.
To submit a run to the shared task we used the following commands:
./scripts/submission/install.sh
./scripts/submission/activate.sh
# Then, choose one of the decode scripts, e.g.
./scripts/submission/decode_taskA_run1.sh
The submission scripts also demonstrate how to make predictions on the test set using a trained model.
If you use our model in your work, please consider citing our paper:
@inproceedings{giorgi-etal-2023-wanglab,
title = {{W}ang{L}ab at {MEDIQA}-Chat 2023: Clinical Note Generation from Doctor-Patient Conversations using Large Language Models},
author = {Giorgi, John and Toma, Augustin and Xie, Ronald and Chen, Sondra and An, Kevin and Zheng, Grace and Wang, Bo},
year = 2023,
month = jul,
booktitle = {Proceedings of the 5th Clinical Natural Language Processing Workshop},
publisher = {Association for Computational Linguistics},
address = {Toronto, Canada},
pages = {323--334},
url = {https://aclanthology.org/2023.clinicalnlp-1.36},
abstract = {This paper describes our submission to the MEDIQA-Chat 2023 shared task for automatic clinical note generation from doctor-patient conversations. We report results for two approaches: the first fine-tunes a pre-trained language model (PLM) on the shared task data, and the second uses few-shot in-context learning (ICL) with a large language model (LLM). Both achieve high performance as measured by automatic metrics (e.g. ROUGE, BERTScore) and ranked second and first, respectively, of all submissions to the shared task. Expert human scrutiny indicates that notes generated via the ICL-based approach with GPT-4 are preferred about as often as human-written notes, making it a promising path toward automated note generation from doctor-patient conversations.}
}