Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto (annotation-free) evaluation of RAG #140

Closed
wants to merge 39 commits into from

Conversation

adkakne
Copy link
Contributor

@adkakne adkakne commented Sep 25, 2024

Description

This PR contains code for Auto Evaluation of RAG and is meant to serve use-case where ground truth answers are not available for evaluation of RAG. We also allow users to bring their own metrics and their own data. Evaluation is performed using LLM-as-a-judge technique and user can choose from OpenAI, HF endpoint and local hardware. Finally, user can see reasoning and score for each metric as generated by the LLM.

Issues

n/a.

Type of change

  • New feature (non-breaking change which adds new functionality)

Dependencies

Python3 packages openai, datasets, python-dotenv and jsonlines are needed.

Tests

I added unit test - tests/test_auto_eval.py and ensured it passes locally in environment built using tests/requirements.txt with python 3.10.

Furthermore, OpenAI models were tested with gpt-4o, gpt-4 and gpt-3.5-turbo-16k. Endpoint testing was performed with meta-llama/Meta-Llama-3-8B-Instruct and mistralai/Mixtral-8x7B-Instruct-v0.1 on Gaudi2 machines. Local testing was performed using mistralai/Mixtral-8x7B-Instruct-v0.1.

For all tests, I used - explodinggradients/ragas-wikiqa.

adkakne and others added 17 commits September 20, 2024 03:27
Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>
Signed-off-by: Yingchun Guo <yingchun.guo@intel.com>
Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>
Co-authored-by: root <root@idc708073.jf.intel.com>
Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>
Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>
for more information, see https://pre-commit.ci

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>
Signed-off-by: ZePan110 <ze.pan@intel.com>
Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>
Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>
Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>
Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>
Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>
Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>
Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>
Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>
Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>
Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>
@adkakne
Copy link
Contributor Author

adkakne commented Sep 25, 2024

Hi @lkk12014402,

The new UT is running correctly locally -

auto_eval_local_run_successful

When I pass absolute path for template_dir, it did not work as we're loading environment using dotenv. So, relative path is to be provided to jinja2.

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>
@minmin-intel
Copy link
Collaborator

@adkakne I feel that auto-eval is more like metrics, so should be in the metrics folder as ragas. Evals in evals/evaluation folder can use auto-eval metrics instead of ragas if they choose to do so. Each folder in evals/evaluation has a benchmark dataset to evaluate against and uses metrics in metrics folder to measure performance.

@adkakne
Copy link
Contributor Author

adkakne commented Sep 25, 2024

@adkakne I feel that auto-eval is more like metrics, so should be in the metrics folder as ragas. Evals in evals/evaluation folder can use auto-eval metrics instead of ragas if they choose to do so. Each folder in evals/evaluation has a benchmark dataset to evaluate against and uses metrics in metrics folder to measure performance.

Hi Minmin, thank you for your comments. I moved auto_eval to metrics folder.

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>
@lkk12014402
Copy link
Collaborator

lkk12014402 commented Sep 26, 2024

hi, kakne @adkakne what are the differences between original ragas. I think we can also use ragas when there are not ground truth answers?

@adkakne
Copy link
Contributor Author

adkakne commented Sep 26, 2024

hi, kakne @adkakne what is the differences between original ragas. I think we can also use ragas when there are not ground truth answers?

Hi Kaokao,

Thank you for your review and comment. I shared a detailed response over MS teams.

Short answer - we want to supplement ragas with annotation free feature as raga allows 4 out of 10 metrics to run without ground truth. (we are not counting summarization score as we don't offer it yet).
ragas_required_columns

Secondly, ragas prompting technique does not work with every open-source LLM (I shared more details offline). AutoEval uses a different prompting technique (which is similar to what used in C-RAG paper by meta).

Lastly, auto_eval makes 1 LLM call per example while ragas makes num_metrics calls per example. So, user can save on cost for proprietary LLMs or improve efficiency while using endpoint on Gaudi / local compute.

Please let me know if you have any more feedback. Thank you.

Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>
@lkk12014402
Copy link
Collaborator

hi, kakne @adkakne what is the differences between original ragas. I think we can also use ragas when there are not ground truth answers?

Hi Kaokao,

Thank you for your review and comment. I shared a detailed response over MS teams.

Short answer - we want to supplement ragas with annotation free feature as raga allows 4 out of 10 metrics to run without ground truth. (we are not counting summarization score as we don't offer it yet). ragas_required_columns

Secondly, ragas prompting technique does not work with every open-source LLM (I shared more details offline). AutoEval uses a different prompting technique (which is similar to what used in C-RAG paper by meta).

Lastly, auto_eval makes 1 LLM call per example while ragas makes num_metrics calls per example. So, user can save on cost for proprietary LLMs or improve efficiency while using endpoint on Gaudi / local compute.

Please let me know if you have any more feedback. Thank you.

Thanks for you explainations. It looks goods to me. Thanks~

@lkk12014402
Copy link
Collaborator

the metric name auto_eval doesn't make sense like ragas, so let us think about a proper name

adkakne and others added 14 commits October 10, 2024 19:48
Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>
* Optimize path and link validity check.

Signed-off-by: ZePan110 <ze.pan@intel.com>
Signed-off-by: rowenaal <rowena.almeida@intel.com>
* added output_folders

* updated subprocess run for output folders

* get report done

* fixed the output_folder issues

* add the func of run_benchmark

* add return

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Support sharegpt dataset in chatqna e2e test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Change the log level for selected questions

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: aasavari <aasavari.dhananjay.kakne@intel.com>
@adkakne
Copy link
Contributor Author

adkakne commented Oct 11, 2024

Closing this PR in order to limit number of commits, added a fresh PR.

@adkakne adkakne closed this Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants