This repo implements the Conformal Alignment procedure in the tasks of question answering and radiology report generation. Given any pre-trained model and new units with model-generated outputs, Conformal Alignment leverages a set of reference data with ground-truth alignment status to train an alignment predictor. It then selects new units whose predicted alignment scores surpass a data-dependent threshold, certifying their corresponding outputs as trustable. It is guaranteed that on average, a prescribed fraction of selected units indeed meet the alignment criterion, regardless of the foundation model or the data distribution.
Answer generating in question answering and calculation of confidence/uncertainty scores follow the implementation in https://github.com/zlin7/UQ-NLG.
This repo supports the TriviaQA and CoQA datasets, which will be prepared in qa/pipeline/generate.py
. Two large language models (LLM) used in the paper are OPT-13B and LLaMA-2-13B-chat.
You need to specify the LLM and dataset in use, e.g. model='llama-2-13b-chat-hf'
, dataset='triviaqa'
.
Use the following command to generate answers by batch (idx
is the index of each batch).
python3 -m pipeline/generate --model $model --dataset $dataset --batch_size 20 --idx $SGE_TASK_ID
After the generation step, use the following command to obtain self-evaluation scores and uncertainty/confidence scores.
python3 -m dataeval/load_run.py --batch_size $bsize --data $data --model $model --idx $SGE_TASK_ID
The script _fdr.py
implements the Conformal Alignment procedure and calculates FDR and power.
python3 -m _fdr.py --data "triviaqa" --model "llama-2-13b-chat-hf" --N 2000 --split_pr 0.5 --split_pr_tune 0.2
notebook/run_qa.ipynb
reproduces figures and examples in the paper.
CXR image preprocessing and vision-language model fine-tuning in the notebook cxr/vlm_finetune.ipynb
follow the implementation in conformal language modeling, in which we use a Vision Transformer (ViT) pretrained on ImageNet-21k as the image encoder and GPT2 as the text decoder.
MIMIC-CXR dataset needs access. See the PhysioNet project page.
After specifying the fine-tuned model (model='trained'
) and dataset (data='cxr'
), use the following command to generate and concatenate outputs (bnum
is the number of batches and bsize
is the batch size).
python3 -m pipeline/generate --idx $SGE_TASK_ID --batch_size $bsize
python3 -m pipeline/generate_encode.py --num_batch $bnum
After the generation step, use the following command to obtain self-evaluation scores and uncertainty/confidence scores.
python3 -m dataeval/load_run.py --idx $SGE_TASK_ID --batch_size $bsize --data $data --model $model
The script _fdr.py
implements the Conformal Alignment procedure and calculates FDR and power.
python3 -m _fdr.py --data "cxr" --model "trained" --N 2000 --split_pr 0.5 --split_pr_tune 0.2
notebook/run_cxr.ipynb
presents examples of report generating using the fine-tuned model and also reproduces figures and examples in the paper.