diff --git a/content/04.results.md b/content/04.results.md index 62ae507..220349b 100644 --- a/content/04.results.md +++ b/content/04.results.md @@ -221,11 +221,12 @@ Although these are important future directions, neither statement accurately des ![ **Automated assessment of preference over revised paragraphs.** -A revision score ($y$-axis) close to 1 indicates that the LLM acting as a judge preferred the revised paragraph over the original one, while a score of -1 indicates the opposite; a zero value indicates either a tie or position bias. +A revision score ($y$-axis) close to 1 indicates that the LLM acting as a judge preferred the revised paragraph over the original one, while a score of -1 indicates the opposite; a score close to zero indicates either a tie or position bias. Each point represents the average score of paragraphs from a section in one of the four manuscripts: CCC, PhenoPLIER, BioChatter and Epistasis. ](images/llm_judge.svg "Automated assessments using LLM-as-a-Judge"){#fig:llm_judge width="90%"} The automatic assessment of paragraphs from different sections across four manuscripts is depicted in Figure @fig:llm_judge. +A revision score above zero indicates that the LLM acting as a judge preferred the revised paragraph over the original one on average, while a score below zero indicates the opposite. It can be seen that the two models used as judges, GPT-3.5 Turbo and GPT-4 Turbo, generally agreed and favored the revised paragraphs over the original ones (revision score above zero) in most cases. The only section where the original paragraphs were clearly preferred was the Abstract of the PhenoPLIER and Epistasis manuscripts. GPT-3.5 Turbo showed a preference for the original abstract of PhenoPLIER in most cases, and the model rationale (Supplementary File 5) aligns with our human assessment: the original abstract provides a more "detailed explanation" of the approaches and a "comprehensive overview of the research."