Auto (annotation-free) evaluation of RAG #140

adkakne · 2024-09-25T01:23:20Z

Description

This PR contains code for Auto Evaluation of RAG and is meant to serve use-case where ground truth answers are not available for evaluation of RAG. We also allow users to bring their own metrics and their own data. Evaluation is performed using LLM-as-a-judge technique and user can choose from OpenAI, HF endpoint and local hardware. Finally, user can see reasoning and score for each metric as generated by the LLM.

Issues

n/a.

Type of change

New feature (non-breaking change which adds new functionality)

Dependencies

Python3 packages openai, datasets, python-dotenv and jsonlines are needed.

Tests

I added unit test - tests/test_auto_eval.py and ensured it passes locally in environment built using tests/requirements.txt with python 3.10.

Furthermore, OpenAI models were tested with gpt-4o, gpt-4 and gpt-3.5-turbo-16k. Endpoint testing was performed with meta-llama/Meta-Llama-3-8B-Instruct and mistralai/Mixtral-8x7B-Instruct-v0.1 on Gaudi2 machines. Local testing was performed using mistralai/Mixtral-8x7B-Instruct-v0.1.

For all tests, I used - explodinggradients/ragas-wikiqa.

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

Signed-off-by: Yingchun Guo <yingchun.guo@intel.com> Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

Co-authored-by: root <root@idc708073.jf.intel.com> Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

for more information, see https://pre-commit.ci Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

Signed-off-by: ZePan110 <ze.pan@intel.com> Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

for more information, see https://pre-commit.ci

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

for more information, see https://pre-commit.ci

adkakne · 2024-09-25T03:01:51Z

Hi @lkk12014402,

The new UT is running correctly locally -

When I pass absolute path for template_dir, it did not work as we're loading environment using dotenv. So, relative path is to be provided to jinja2.

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

minmin-intel · 2024-09-25T22:22:48Z

@adkakne I feel that auto-eval is more like metrics, so should be in the metrics folder as ragas. Evals in evals/evaluation folder can use auto-eval metrics instead of ragas if they choose to do so. Each folder in evals/evaluation has a benchmark dataset to evaluate against and uses metrics in metrics folder to measure performance.

evals/evaluation/auto_eval/auto_eval_metrics/opening_prompt.md

adkakne · 2024-09-25T23:29:46Z

@adkakne I feel that auto-eval is more like metrics, so should be in the metrics folder as ragas. Evals in evals/evaluation folder can use auto-eval metrics instead of ragas if they choose to do so. Each folder in evals/evaluation has a benchmark dataset to evaluate against and uses metrics in metrics folder to measure performance.

Hi Minmin, thank you for your comments. I moved auto_eval to metrics folder.

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

lkk12014402 · 2024-09-26T14:26:40Z

hi, kakne @adkakne what are the differences between original ragas. I think we can also use ragas when there are not ground truth answers?

adkakne · 2024-09-26T21:02:27Z

hi, kakne @adkakne what is the differences between original ragas. I think we can also use ragas when there are not ground truth answers?

Hi Kaokao,

Thank you for your review and comment. I shared a detailed response over MS teams.

Short answer - we want to supplement ragas with annotation free feature as raga allows 4 out of 10 metrics to run without ground truth. (we are not counting summarization score as we don't offer it yet).

Secondly, ragas prompting technique does not work with every open-source LLM (I shared more details offline). AutoEval uses a different prompting technique (which is similar to what used in C-RAG paper by meta).

Lastly, auto_eval makes 1 LLM call per example while ragas makes num_metrics calls per example. So, user can save on cost for proprietary LLMs or improve efficiency while using endpoint on Gaudi / local compute.

Please let me know if you have any more feedback. Thank you.

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

lkk12014402 · 2024-09-27T00:40:23Z

hi, kakne @adkakne what is the differences between original ragas. I think we can also use ragas when there are not ground truth answers?

Hi Kaokao,

Thank you for your review and comment. I shared a detailed response over MS teams.

Short answer - we want to supplement ragas with annotation free feature as raga allows 4 out of 10 metrics to run without ground truth. (we are not counting summarization score as we don't offer it yet).

Secondly, ragas prompting technique does not work with every open-source LLM (I shared more details offline). AutoEval uses a different prompting technique (which is similar to what used in C-RAG paper by meta).

Lastly, auto_eval makes 1 LLM call per example while ragas makes num_metrics calls per example. So, user can save on cost for proprietary LLMs or improve efficiency while using endpoint on Gaudi / local compute.

Please let me know if you have any more feedback. Thank you.

Thanks for you explainations. It looks goods to me. Thanks~

lkk12014402 · 2024-10-08T00:54:00Z

the metric name auto_eval doesn't make sense like ragas, so let us think about a proper name

for more information, see https://pre-commit.ci

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

* Optimize path and link validity check. Signed-off-by: ZePan110 <ze.pan@intel.com>

Signed-off-by: rowenaal <rowena.almeida@intel.com>

* added output_folders * updated subprocess run for output folders * get report done * fixed the output_folder issues * add the func of run_benchmark * add return * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Support sharegpt dataset in chatqna e2e test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change the log level for selected questions --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…roject#156) Signed-off-by: ZePan110 <ze.pan@intel.com>

for more information, see https://pre-commit.ci

… auto_eval

for more information, see https://pre-commit.ci

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

for more information, see https://pre-commit.ci

adkakne · 2024-10-11T07:25:41Z

Closing this PR in order to limit number of commits, added a fresh PR.

adkakne and others added 17 commits September 20, 2024 03:27

minimized required fields/columns in user data

669997a

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

add bench-target as the prefix of output folder (opea-project#133)

80e2160

Signed-off-by: Yingchun Guo <yingchun.guo@intel.com> Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

remove examples. (opea-project#135)

eb98d2e

Co-authored-by: root <root@idc708073.jf.intel.com> Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

minor naming correction to maintain consistency

ad58bd8

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

c49ea84

for more information, see https://pre-commit.ci Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

Add hyperlinks and paths validation. (opea-project#132)

50d4167

Signed-off-by: ZePan110 <ze.pan@intel.com> Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

Merge remote-tracking branch 'upstream/main'

81151fd

adding README for OPEA ragas

bafc701

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

adding python3 syntax to README

1cc5ffe

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

Merge remote-tracking branch 'upstream/main' into auto_eval

08d6581

adding auto (annotation-free) evaluation - functionality

3390b5d

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

auto-eval endpoint on gaudi - tested and working

f998a00

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

merging latest changes from upstream main

7ffac2b

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

updating testing environment

6bfadb2

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

adding unit test for auto eval - passing successfully

bd0d2af

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

editing parameters for online test environment

85b1ff9

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

added working example with endpoint to README

225f1a7

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

adkakne requested a review from lkk12014402 as a code owner September 25, 2024 01:23

pre-commit-ci bot and others added 3 commits September 25, 2024 01:23

[pre-commit.ci] auto fixes from pre-commit.com hooks

3a67ff3

for more information, see https://pre-commit.ci

added init file and changed import paths accordingly

9e2714e

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

2079a4d

for more information, see https://pre-commit.ci

automatically setting template_dir param

b4fa896

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

minmin-intel reviewed Sep 25, 2024

View reviewed changes

evals/evaluation/auto_eval/auto_eval_metrics/opening_prompt.md Outdated Show resolved Hide resolved

adkakne requested a review from XinyuYe-Intel as a code owner September 25, 2024 23:28

moved auto_eval to metrics and generalized opening prompt

f266c3c

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

testing of auto-eval with Llama 3.2 successful

f134058

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

lkk12014402 approved these changes Sep 27, 2024

View reviewed changes

lvliang-intel added 2 commits September 27, 2024 10:27

Merge branch 'main' into auto_eval

ccf74a7

Merge branch 'main' into auto_eval

85ac47a

adkakne and others added 14 commits October 10, 2024 19:48

removed .env loading and modularized metric templates

3ca8ecb

Merge branch 'main' into auto_eval

e9c087e

[pre-commit.ci] auto fixes from pre-commit.com hooks

9157b83

for more information, see https://pre-commit.ci

merging requirements

e9c915a

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

Optimize path and link validity check. (opea-project#143)

b0424f0

* Optimize path and link validity check. Signed-off-by: ZePan110 <ze.pan@intel.com>

Signed-off-by: Rowena Almeida <rowena.almeida.com> (opea-project#150)

d4c3391

Signed-off-by: rowenaal <rowena.almeida@intel.com>

Fix the issue of exiting due to inability to find hyperlinks. (opea-p…

ddd3607

…roject#156) Signed-off-by: ZePan110 <ze.pan@intel.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

e8a9850

for more information, see https://pre-commit.ci

Merge branch 'auto_eval' of https://github.com/adkakne/GenAIEval into…

1913b03

… auto_eval

[pre-commit.ci] auto fixes from pre-commit.com hooks

c332999

for more information, see https://pre-commit.ci

Removing python-dotenv from requirements

68a14a0

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

56ad626

for more information, see https://pre-commit.ci

adkakne closed this Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto (annotation-free) evaluation of RAG #140

Auto (annotation-free) evaluation of RAG #140

adkakne commented Sep 25, 2024

adkakne commented Sep 25, 2024 •

edited

Loading

minmin-intel commented Sep 25, 2024

adkakne commented Sep 25, 2024

lkk12014402 commented Sep 26, 2024 •

edited

Loading

adkakne commented Sep 26, 2024 •

edited

Loading

lkk12014402 commented Sep 27, 2024

lkk12014402 commented Oct 8, 2024

adkakne commented Oct 11, 2024

Auto (annotation-free) evaluation of RAG #140

Auto (annotation-free) evaluation of RAG #140

Conversation

adkakne commented Sep 25, 2024

Description

Issues

Type of change

Dependencies

Tests

adkakne commented Sep 25, 2024 • edited Loading

minmin-intel commented Sep 25, 2024

adkakne commented Sep 25, 2024

lkk12014402 commented Sep 26, 2024 • edited Loading

adkakne commented Sep 26, 2024 • edited Loading

lkk12014402 commented Sep 27, 2024

lkk12014402 commented Oct 8, 2024

adkakne commented Oct 11, 2024

adkakne commented Sep 25, 2024 •

edited

Loading

lkk12014402 commented Sep 26, 2024 •

edited

Loading

adkakne commented Sep 26, 2024 •

edited

Loading