Skip to content

Commit

Permalink
left-align captions + shrink corr plot
Browse files Browse the repository at this point in the history
  • Loading branch information
davevanveen committed Sep 15, 2023
1 parent f3e0ac7 commit 46384a8
Showing 1 changed file with 8 additions and 18 deletions.
26 changes: 8 additions & 18 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -330,7 +330,7 @@
<table align=center width=800px>
<tr>
<td align=left width=800px>
x2 Model win rate: a head-to-head winning percentage of each model combination, where red/blue intensities highlight the degree to which models on the vertical axis outperform models on the horizontal axis. GPT-4 generally achieves the best performance. While FLAN-T5 is more competitive for syntactic metrics such as BLEU, we note this model is constrained to shorter context lengths (see Table 1). When aggregated across datasets, seq2seq models (FLAN-T5, FLAN-UL2) outperform open-source autoregressive models (Llama-2, Vicuna) on all metrics.
Model win rate: a head-to-head winning percentage of each model combination, where red/blue intensities highlight the degree to which models on the vertical axis outperform models on the horizontal axis. GPT-4 generally achieves the best performance. While FLAN-T5 is more competitive for syntactic metrics such as BLEU, we note this model is constrained to shorter context lengths (see Table 1). When aggregated across datasets, seq2seq models (FLAN-T5, FLAN-UL2) outperform open-source autoregressive models (Llama-2, Vicuna) on all metrics.
</td>
</tr>
</table>
Expand All @@ -352,10 +352,8 @@

<table align=center width=800px>
<tr>
<td align=center width=800px>
<center>
<td align=left width=800px>
Quantitative metric (MEDCON) scores vs. number of in-context examples across models and datasets. We also include the best model fine-tuned with QLoRA (FLAN-T5) as a horizontal dashed line for valid datasets. Note the allowable number of in-context examples varies signficantly my model context length and dataset size. See the paper for more details and results across other metrics (BLEU, ROUGE-L, BERTScore).
</center>
</td>
</tr>
</table>
Expand All @@ -377,11 +375,9 @@

<table align=center width=800px>
<tr>
<td align=center width=800px>
<center>
<td align=left width=800px>
Clinical reader study. Top: Study design comparing the summarization of GPT-4 vs. that of human experts on three attributes: completeness, correctness, and conciseness. Bottom: Results. GPT-4 summaries are rated higher than human summaries on completeness for all three summarization tasks and on correctness overall. Radiology reports highlight a trade-off between correctness (better) and conciseness (worse) with GPT-4. Highlight colors correspond to a value’s location on the color spectrum. Asterisks denote statistical significance by Wilcoxon signed-rank test.

</center>
</td>
</tr>
</table>
Expand All @@ -404,10 +400,8 @@

<table align=center width=800px>
<tr>
<td align=center width=800px>
<center>
<td align=left width=800px>
Distribution of reader scores for each summarization task across evaluated attributes (completeness, correctness, conciseness). Horizontal axes denote reader preference between GPT-4 and human summaries as measured by a five-point Likert scale. Vertical axes denote frequency count, with 900 total reports for each plot. GPT-4 summaries are more often preferred in terms of correctness and completeness. While the largest gain in correctness occurs on radiology reports, this introduces a trade-off with conciseness. See Figure 6 for overall scores.
</center>
</td>
</tr>
</table>
Expand All @@ -430,10 +424,8 @@

<table align=center width=800px>
<tr>
<td align=center width=800px>
<center>
<td align=left width=800px>
Annotation of two radiologist report examples from the reader study. In the top example, GPT-4 performs better due to a laterality mistake by the human expert. In the bottom example, GPT-4 exhibits a lack of conciseness. The table (lower left) contains reader scores for these two examples and the task average across all samples.
</center>
</td>
</tr>
</table>
Expand All @@ -444,9 +436,9 @@


<br>
<table align=center width=400px>
<table align=center width=200px>
<tr>
<td align=center width=400px >
<td align=center width=200px >
<center >
<td class="img-magnifier-container"><img id="myimage" style="width:800px" src="resources/corr_plot.png"/></td>
</center>
Expand All @@ -456,10 +448,8 @@

<table align=center width=800px>
<tr>
<td align=center width=800px>
<center>
<td align=left width=800px>
Spearman correlation coefficients between NLP metrics and reader preference assessing completeness, correctness, and conciseness. The semantic metric (BERTScore) and conceptual metric (MEDCON) correlate most highly with correctness. Meanwhile, syntactic metrics BLEU and ROUGE-L correlate most with completeness. Section 5.3 contains further description and discussion.
</center>
</td>
</tr>
</table>
Expand Down

0 comments on commit 46384a8

Please sign in to comment.