Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto (annotation-free) evaluation of RAG #140

Closed
wants to merge 39 commits into from
Closed
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
669997a
minimized required fields/columns in user data
adkakne Sep 19, 2024
80e2160
add bench-target as the prefix of output folder (#133)
daisy-ycguo Sep 19, 2024
eb98d2e
remove examples. (#135)
lkk12014402 Sep 19, 2024
ad58bd8
minor naming correction to maintain consistency
adkakne Sep 20, 2024
c49ea84
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 20, 2024
50d4167
Add hyperlinks and paths validation. (#132)
ZePan110 Sep 20, 2024
81151fd
Merge remote-tracking branch 'upstream/main'
adkakne Sep 20, 2024
bafc701
adding README for OPEA ragas
adkakne Sep 20, 2024
1cc5ffe
adding python3 syntax to README
adkakne Sep 20, 2024
08d6581
Merge remote-tracking branch 'upstream/main' into auto_eval
adkakne Sep 24, 2024
3390b5d
adding auto (annotation-free) evaluation - functionality
adkakne Sep 24, 2024
f998a00
auto-eval endpoint on gaudi - tested and working
adkakne Sep 24, 2024
7ffac2b
merging latest changes from upstream main
adkakne Sep 24, 2024
6bfadb2
updating testing environment
adkakne Sep 24, 2024
bd0d2af
adding unit test for auto eval - passing successfully
adkakne Sep 25, 2024
85b1ff9
editing parameters for online test environment
adkakne Sep 25, 2024
225f1a7
added working example with endpoint to README
adkakne Sep 25, 2024
3a67ff3
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 25, 2024
9e2714e
added init file and changed import paths accordingly
adkakne Sep 25, 2024
2079a4d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 25, 2024
b4fa896
automatically setting template_dir param
adkakne Sep 25, 2024
f266c3c
moved auto_eval to metrics and generalized opening prompt
adkakne Sep 25, 2024
f134058
testing of auto-eval with Llama 3.2 successful
adkakne Sep 26, 2024
ccf74a7
Merge branch 'main' into auto_eval
lvliang-intel Sep 27, 2024
85ac47a
Merge branch 'main' into auto_eval
lvliang-intel Oct 7, 2024
3ca8ecb
removed .env loading and modularized metric templates
adkakne Oct 11, 2024
e9c087e
Merge branch 'main' into auto_eval
adkakne Oct 11, 2024
9157b83
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 11, 2024
e9c915a
merging requirements
adkakne Oct 11, 2024
b0424f0
Optimize path and link validity check. (#143)
ZePan110 Oct 8, 2024
d4c3391
Signed-off-by: Rowena Almeida <rowena.almeida.com> (#150)
rowenaal Oct 8, 2024
e321003
[Benchmark] Get benchmark reports. (#155)
Zhenzhong1 Oct 10, 2024
f1d2099
Support sharegpt dataset in chatqna e2e test (#152)
joshuayao Oct 10, 2024
ddd3607
Fix the issue of exiting due to inability to find hyperlinks. (#156)
ZePan110 Oct 11, 2024
e8a9850
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 11, 2024
1913b03
Merge branch 'auto_eval' of https://github.com/adkakne/GenAIEval into…
adkakne Oct 11, 2024
c332999
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 11, 2024
68a14a0
Removing python-dotenv from requirements
adkakne Oct 11, 2024
56ad626
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions evals/evaluation/auto_eval/.env
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
OPENAI_KEY=xxx
57 changes: 57 additions & 0 deletions evals/evaluation/auto_eval/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Auto (annotation-free) Evaluation of Retrieval Augmented Generation

We provide easy-to-use, flexible and annotation-free RAG evaluation tool using LLM-as-a-judge while benefitting from Intel's Gaudi2 AI accelator chips.

## Overview
### Data
AutoEval is best suited for Long Form Question Answering (LFQA) datasets where you want to gauge quality and factualness of the answer via LLM's intelligence. Here, you can use benchmarking datasets or bring your own custom datasets. Please make sure to set `field_map` to map AutoEval fields such as "question" to your dataset's corresponding field like "query".
> Note : To use benchmarking datasets, set argument `data_mode=benchmarking`. Similarly, to use custom datasets, set `data_mode=local`.
### Model
AutoEval can run in 3 evaluation modes -
1. `evaluation_mode="endpoint"` uses HuggingFace endpoint.
- We recommend launching a HuggingFace endpoint on Gaudi AI accelerator machines to ensure maximum usage and performance.
- To launch HF endpoint on Gaudi2, please follow the 2-step instructions here - [tgi-gaudi](https://github.com/huggingface/tgi-gaudi).
- Pass your endpoint url as `model_name` argument.
2. `evaluation_mode="openai"` uses openai backend.
- Please set your `OPEN_API_KEY` and your choice of model as `model_name` argument.
3. `evaluation_mode="local"` uses your local hardware.
- Set `hf_token` argument and set your favourite open-source model in `model_name` argument.
- GPU usage will be prioritized after checking it's availability. If GPU is unavailable, the model will run on CPU.
## Metrics
AutoEval provides 4 metrics - factualness, correctness, relevance and readability. You can also bring your own metrics and grading scales. Don't forget to add your metric to `evaluation_metrics` argument.
## Generation configuration
Please set generation parameters as per your requirement in `GENERATION_CONFIG` in `run_eval.py`.

## Run using HF endpoint
```python3
dataset = "explodinggradients/ragas-wikiqa"
data_mode = "benchmarking"
field_map = {"question": "question", "answer": "generated_with_rag", "context": "context"}

template_dir = "auto_eval_metrics"

evaluation_mode = "endpoint"

host_ip = os.getenv("host_ip", "localhost")
port = os.getenv("port", "<add your port where your endpoint is running>")
model_name = f"http://{host_ip}:{port}"

evaluation_metrics = ["factualness", "relevance", "correctness", "readability"]

evaluator = AutoEvaluate(
dataset=dataset,
data_mode=data_mode,
field_map=field_map,
template_dir=template_dir,
evaluation_mode=evaluation_mode,
model_name=model_name,
evaluation_metrics=evaluation_metrics,
debug_mode=True,
)

responses = evaluator.measure()

for response in responses:
print(response)
```
That's it! For troubleshooting, please submit an issue and we will get right on it.
10 changes: 10 additions & 0 deletions evals/evaluation/auto_eval/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

#

from .run_eval import AutoEvaluate

__all__ = [AutoEvaluate]
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
- Correctness: correctness measures how accurately and comprehensively does the answer resolve problem posed in the question.
- Score 1: If the answer is empty string or something like "I do not know the answer", the correctness score is 1.
- Score 2: If the answer only addresses a small part of the question correctly or it is missing many critical steps/aspects of the answer or the answer is too short to fully answer the question or is missing many steps causing the answer to not fully address the problem described in the question, then the correctness score is 2.
- Score 3: The answer mostly addresses the question but one critical aspect/step is missing or is incorrect.
- Score 4: the answer mostly answer the question and covers all critical/main aspects of the question, but it’s missing important/necessary details about one or more aspects.
- Score 5: the answer correctly and completely addresses the query. It also covers important details about each step.
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
- Factualness: Factualness assesses how much of the provided answer is contained within the provided context. A higher score indicates that a higher proportion of claims present in the answer are present or can be derived from the provided context.
- Score 1: the answer is completely hallucinated i.e. not contained in the context at all or there is no answer.
- Score 2: only a small part of the answer is contained in the context but most of it is imaginary/hallucinated or the meaning is completely changed from what is represented in the context.
- Score 3: Only about half of the answer is contained in the context. Rest of the answer is hallucinated or imaginary.
- Score 4: Most of the claims in the answer can be inferred from the provided context with very little information that is not directly supported by the provided context.
- Score 5: All of the claims in the answer are directly supported by the provided context, demonstrating high faithfulness to the provided context.
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
- Readability: Readability measures clarity and lucidity of the answer. Readability is measured solely based on the answer and it does not consider the question or the context.
- Score 1: the answer is empty or "I do not know the answer" or completely unreadable or No meaningful information can be extracted from the answer, then the score is 1.
- Score 2: the answer is slightly readable, there are irrelevant symbols or HTML tags or repeated words, but it can roughly form a meaningful sentence that can cover some aspects of the answer.
- Score 3: Answer can be read but there are grammatical mistakes in the answer.
- Score 4: the answer readable, but the readability and style can improved to better appeal to the reader.
- Score 5: the answer is reader friendly and well written.
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
- Relevance: Relevance measures how well the answer relates to the question.
- Score 1: The answer doesn't mention anything about the question or is completely irrelevant to the question.
- Score 2: The answer only identifies the domain (e.g. cnvrg) mentioned in the question and provides information from the correct domain. But, the answer does not address the question itself and the point of the question is completely missed by it.
- Score 3: The answer correctly identifies the domain and essence of the question but the details in the answer are not relevant to the focus of the question.
- Score 4: The answer correctly identifies domain mentioned the question and essence of the question as well as stays consistent with both of them. But there is some part of the answer that is not relevant to the question or it's topic or it's essence. This irrelevant part is damaging the overall relevance of the answer.
- Score 5: The answer is completely relevant to the question and the details do not deviate from the essence of the question. There are no parts of the answer that are irrelevant or unnecessary for the given question.
13 changes: 13 additions & 0 deletions evals/evaluation/auto_eval/auto_eval_metrics/opening_prompt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Consider yourself as an engineer working at cnvrg.io which is a Full Stack Machine Learning Operating System owned by Intel.
adkakne marked this conversation as resolved.
Show resolved Hide resolved

Your task:
You will be given an input consisting of a question, an answer and a context. Your task is to act as an impartial judge and provide a numerical score between 1 to 5 for each of the following metrics for the given answer.

Important rules for you while completing this task:
1. You MUST ALWAYS provide a score for every metric mentioned below.
2. Make sure to understand definition of every metric fully before completing your task. Every metric is provided with grading scale and rubric. You MUST use this grading scale and rubric to determine your score.
3. Ensure that your scores and reasoning for every metric is independent of each other e.g., score for factualness should not impact score for correctness and vice versa.
4. Base your grading decision only on the given inputs and do not speculate or hallucinate.
5. You must also provide reasoning for your score in a single sentence.

Your metric definitions along with grading scale and rubric:
92 changes: 92 additions & 0 deletions evals/evaluation/auto_eval/prompt_engineering.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import os

from dotenv import load_dotenv
from jinja2 import Environment, FileSystemLoader, Template


class Prompt:
"""Class to customize prompt template using user-defined list of metrics."""

def __init__(self, metrics, input_fields, prompt_dir):
self.metrics = metrics
self.input_fields = input_fields
self.define_template_paths(prompt_dir)
self.template = self.load_prompt_template()

def define_template_paths(self, prompt_dir):
self.opening_prompt_path = os.path.join(prompt_dir, "opening_prompt.md")
metric_prompt_names = ["{}_prompt.md".format(metric) for metric in self.metrics]
local_metric_prompt_paths = [os.path.join("metric_prompt_templates", m) for m in metric_prompt_names]
self.metric_prompt_paths = [os.path.join(prompt_dir, p) for p in local_metric_prompt_paths]

def create_grading_format(self):
grading_format = (
"You must ALWAYS provide every single one of the scores and reasonings in the following JSON format:"
)
grading_format += "\n" + "{" + "\n"
content = []
reasoning_prompt = "Reasoning for {}: [your one line step by step reasoning about the {} of the answer]"
scoring_prompt = "Score for {}: [your score number for the {} of the answer]"
for metric in self.metrics:
reasoning = reasoning_prompt.format(metric, metric)
score = scoring_prompt.format(metric, metric)
content += (reasoning + "\n" + score,)
grading_format += "\n\n".join(content)
grading_format += "\n" + "}"
return grading_format

def create_closing_prompt(self):
closing_prompt = ["Let's begin!"]
for f in self.input_fields:
closing_prompt += ("Provided {}:".format(f) + "\n" + "{{" + f + "}}",)
return "\n\n".join(closing_prompt)

@staticmethod
def load_template(template_path):
dir = os.path.dirname(os.path.abspath(__file__))
env = Environment(loader=FileSystemLoader(dir))
return env.get_template(template_path)

def load_prompt_template(self):
content = [self.load_template(self.opening_prompt_path).render()]
for path in self.metric_prompt_paths:
content += (self.load_template(path).render(),)
content += (self.create_grading_format(),)
content += (self.create_closing_prompt(),)
return Template("\n\n".join(content))

def render_prompt(self, **kwargs) -> str:
text = self.template.render(**kwargs)
return text


if __name__ == "__main__":

"""Here, we test implementation of Prompt class."""

# step 0 - user input
metrics = ["factualness", "relevance", "correctness", "readability"]
input_fields = ["question", "answer", "context"]
prompt_dir = "./auto_eval_metrics/"

# step 1 - load jinja2 environment
load_dotenv(os.path.join(os.path.dirname(__file__), ".env"), override=True)

# step 2 - load prompt using Prompt class
prompt = Prompt(metrics=metrics, input_fields=input_fields, prompt_dir=prompt_dir)

example = {
"question": "Who is wife of Barak Obama",
"context": "Michelle Obama, wife of Barak Obama (former President of the United States of America) is an attorney. Barak and Michelle Obama have 2 daughters - Malia and Sasha",
"answer": "Michelle Obama",
"ground_truth": "Wife of Barak Obama is Michelle Obama",
}

rendered_prompt = prompt.render_prompt(
question=example["question"], answer=example["answer"], context=example["context"]
)

print(rendered_prompt)
86 changes: 86 additions & 0 deletions evals/evaluation/auto_eval/rag_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import os

import jsonlines
from datasets import Dataset, load_dataset


class RAGDataset:
"""Dataset class to store data in HF datasets API format."""

def __init__(self, dataset, field_map, mode):
self.dataset = dataset
self.field_map = field_map
assert mode in ["local", "benchmarking"], "mode can be either local or benchmarking"
self.mode = mode
self.data = self.load_data()
self.validate_dataset()

def load_data(self):
if self.mode == "local":
assert os.path.exists(self.dataset), "There is no such file - {}".format(self.dataset)
with jsonlines.open(self.dataset) as reader:
data = []
for obj in reader:
ex = {}
for out_field, in_field in self.field_map.items():
if type(obj[in_field]) == list:
ex[out_field] = "\n".join(obj[in_field])
else:
ex[out_field] = obj[in_field]
data.append(ex)
return Dataset.from_list(data)
else:
data = []
for obj in load_dataset(self.dataset)["train"]:
ex = {}
for out_field, in_field in self.field_map.items():
if type(obj[in_field]) == list:
ex[out_field] = "\n".join(obj[in_field])
else:
ex[out_field] = obj[in_field]
data.append(ex)
return Dataset.from_list(data)

def validate_dataset(self):
for i, example in enumerate(self.data):
for out_field in self.field_map:
assert out_field in example, "Example {} does not have {} field".format(i + 1, out_field)

def __getitem__(self, index):
return self.data[index]

def __len__(self):
return len(self.data)

def __iter__(self):
return iter(self.data)


if __name__ == "__main__":

dataset_path = "../../benchmark/ragas/ground_truth.jsonl"
field_map = {
"question": "question",
"ground_truth": "ground_truth",
"context": "context",
}

ds = RAGDataset(dataset=dataset_path, field_map=field_map, mode="local")

for i, ex in enumerate(ds):
assert ex["question"] == ds[i]["question"], "index {} does not have correct query".format(i)

dataset = "explodinggradients/ragas-wikiqa"
field_map = {
"question": "question",
"answer": "generated_with_rag",
"context": "context",
"ground_truth": "correct_answer",
}
ds = RAGDataset(dataset=dataset, field_map=field_map, mode="benchmarking")

for i, ex in enumerate(ds):
assert ex["question"] == ds[i]["question"], "index {} does not have correct query".format(i)
88 changes: 88 additions & 0 deletions evals/evaluation/auto_eval/run_eval.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import argparse
import ast
import json
import os
import time

import pandas as pd
from dotenv import load_dotenv
from huggingface_hub import login
from jinja2 import Environment, FileSystemLoader

from .prompt_engineering import Prompt
from .rag_dataset import RAGDataset
from .utils.helper import *
from .utils.model import *


class AutoEvaluate:

def __init__(
self,
dataset,
data_mode,
field_map,
evaluation_mode,
model_name,
evaluation_metrics,
hf_token=None,
openai_key=None,
debug_mode=None,
):
self.GENERATION_CONFIG = {
"openai": {"temperature": 0.1},
"endpoint": {"max_tokens": 500},
"local": {"max_new_tokens": 500},
}
self.load_env()
self.data = RAGDataset(dataset=dataset, field_map=field_map, mode=data_mode)
self.evaluator = self.get_evaluator(evaluation_mode, model_name, openai_key, hf_token)
self.prompt_template = self.get_template(evaluation_metrics, field_map)
self.debug_mode = debug_mode
self.generation_config = self.GENERATION_CONFIG[evaluation_mode]

def load_env(
self,
):
dot_env_path = os.path.join(os.path.dirname(__file__), ".env")
print("Loading dot environment from {}".format(dot_env_path))
load_dotenv(dot_env_path, override=True)

def get_evaluator(self, evaluation_mode, model_name, openai_key=None, hf_token=None):
if evaluation_mode == "openai":
# assert args.model_name in ALLOWED_OPENAI_MODELS, "please provide a openai model from the given list of allowed models"
print("Using {} openai key".format(openai_key))
evaluator = OAIEvaluator(openai_key, model_name)
elif evaluation_mode == "endpoint":
print("Loading HF endpoint at {}".format(model_name))
evaluator = EndpointEvaluator(model_name)
else:
assert args.evaluation_mode == "local", "evaluation mode must be openai / endpoint / local"
print("Loading {} model locally".format(model_name))
login(token=hf_token)
evaluator = HFEvaluator(args.model_name)
return evaluator

def get_template(self, evaluation_metrics, field_map):
return Prompt(metrics=evaluation_metrics, input_fields=field_map, prompt_dir="./auto_eval_metrics").template

def measure(self):
n_samples = 1 if self.debug_mode else len(self.data)
responses = [""] * n_samples
start = time.time()
for i in range(n_samples):
prompt = render_prompt(
self.prompt_template,
query=self.data[i]["question"],
answer=self.data[i]["answer"],
context=self.data[i]["context"],
)
messages = [{"role": "user", "content": prompt}]
response = self.evaluator.generate(messages, **self.generation_config)
responses[i] = response
end = time.time()
print("Generation of scores and reasoning took {:.2f} seconds for {:,} examples".format(end - start, n_samples))
return responses
7 changes: 7 additions & 0 deletions evals/evaluation/auto_eval/utils/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import sys
import os

sys.path.append(os.path.dirname(os.path.abspath(__file__)))
Loading
Loading