Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Icelandic inflection eval; JsonMatch eval function #1387

Merged
merged 2 commits into from
Oct 27, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,6 @@ venv/
.idea/

build

openai-key.txt
*.code-workspace
4 changes: 4 additions & 0 deletions docs/eval-templates.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,10 @@ For a model completion `a` and a reference list of correct answers `B`, the foll
- [`basic/includes.py:Includes`](../evals/elsuite/basic/includes.py): `any([(b in a) for b in B])`
- [`basic/fuzzy_match.py:FuzzyMatch`](../evals/elsuite/basic/fuzzy_match.py): `any([(a in b or b in a) for b in B])`

To compare a model completion `a` in *JSON format* to a reference list of correct answers `B` also formatted in JSON, use the following eval:
- [`basic/json_match.py:JsonMatch`](../evals/elsuite/basic/json_match.py) yields a match if `a` is identical to at least one answer from `B`. Two JSON objects are
identical if they have the same set of keys and the values for each key are identical. Key order is not significant, and whitespace outside values is ignored. Invalid JSON never matches.

Which eval template you use will depend on your use case. It is always recommended that you inspect the completions from your model, as this will help you determine how and whether to tweak your prompt (or your reference answers) and pick your eval template. Academic benchmarks oftentimes fit the mold of these basic evals, and we have implemented several end-to-end examples of academic evals as Jupyter notebooks in the `examples` folder.

Sometimes, [custom eval logic](custom-eval.md) will better suit your needs. One example of this is the [machine translation](../evals/elsuite/translate.py) eval [example](../examples/lafand-mt.ipynb), in which there is a unique and clearly defined metric that we wish to use in our eval. You should use your best judgment when deciding between custom eval logic, using a basic eval template, or using model-graded evals as described next.
Expand Down
106 changes: 106 additions & 0 deletions evals/elsuite/basic/json_match.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
import json
import random
from typing import Any, Dict, List, Mapping, Union, cast

import numpy as np

import evals
from evals.api import CompletionFn
from evals.record import RecorderBase


def json_match(sampled_json: Any, correct_json: Any) -> bool:
"""Return True if the sampled completion in JSON format
matches a correct answer, component by component"""
if sampled_json is None or correct_json is None:
# Missing values are never correct
return False
if isinstance(sampled_json, dict):
if isinstance(correct_json, dict):
sample = cast(Mapping[str, Any], sampled_json)
correct = cast(Mapping[str, Any], correct_json)
all_keys = set(sample.keys()) | set(correct.keys())
return all(json_match(sample.get(key), correct.get(key)) for key in all_keys)
else:
return False
elif isinstance(sampled_json, list):
if isinstance(correct_json, list):
slist = cast(List[Any], sampled_json)
clist = cast(List[Any], correct_json)
if len(slist) != len(clist):
# Lists must have the same length
return False
return all(json_match(s, c) for s, c in zip(slist, clist))
else:
return False
# Not a structured item: do a direct comparison
return sampled_json == correct_json


class JsonMatch(evals.Eval):

"""Compares a JSON completion with one or more ideal answers,
also coded in JSON. The decoded JSON objects are compared
elementwise and must match exactly."""

def __init__(
self,
completion_fns: list[CompletionFn],
samples_jsonl: str,
*args: Any,
max_tokens: int = 512, # Increase this for longer JSON completions
**kwargs: Any,
):
super().__init__(completion_fns, *args, **kwargs)
assert len(completion_fns) == 1, "JsonMatch only supports one completion fn"
self.max_tokens = max_tokens
self.samples_jsonl = samples_jsonl

def eval_sample(self, sample: Any, rng: random.Random):
del rng

assert isinstance(sample, dict), "sample must be a dict"
assert "input" in sample, "sample must have an 'input' key"
assert "ideal" in sample, "sample must have an 'ideal' key"

prompt = cast(str, sample["input"])
correct_answers = cast(Union[str, List[str]], sample["ideal"])
if not isinstance(correct_answers, list):
correct_answers = [correct_answers]

result = self.completion_fn(
prompt=prompt,
temperature=0.0, # Q: why are these hardcoded?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We set temp=0 for determinism in running the eval, otherwise results won't necessarily be replicable.

If there's a clear need for nonzero temp in the eval, we can definitely add it as an arg!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, understood - this comment and question was in the original FuzzyMatch code that I used as a template for JsonMatch, so I just left it in ;-)

max_tokens=self.max_tokens,
)
sampled = result.get_completions()[0]

sampled_json: Any
try:
sampled_json = json.loads(sampled)
except ValueError:
# If the sampled string is not valid JSON, it will never match
sampled_json = None

# Allow the following to raise ValueError; the correct answers
# should always be valid JSON
correct_json = [json.loads(correct_answer) for correct_answer in correct_answers]

matches = [json_match(sampled_json, cj) for cj in correct_json]

evals.record.record_match(
True in matches,
expected=correct_answers,
picked=[sampled for i in range(len(correct_answers)) if matches[i]],
)
evals.record.record_metrics(
accuracy=float(True in matches),
)

def run(self, recorder: RecorderBase) -> Dict[str, float]:
samples = self.get_samples()
self.eval_all_samples(recorder, samples)

return {
"accuracy": np.mean(recorder.get_scores("accuracy")),
}
98 changes: 98 additions & 0 deletions evals/elsuite/basic/json_match_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
from pathlib import Path
from typing import Any, Type

from mock import patch
from pytest import mark, raises

from evals.api import DummyCompletionFn
from evals.elsuite.basic.json_match import JsonMatch
from evals.record import DummyRecorder
from evals.utils.test import TestCompletionFn


@mark.parametrize(
"completion, ideal, expected_metrics",
[
# Basic match
('{ "key": "value" }', '{ "key": "value" }', dict(accuracy=1.0)),
# Whitespace is not significant
('{\n "key":"value"\n }\n', '{ "key": "value" }', dict(accuracy=1.0)),
# Key order is not significant
(
'{ "key2": "foo", "key1": "bar" }',
'{ "key1": "bar", "key2": "foo" }',
dict(accuracy=1.0),
),
# No match if values are different
('{ "key": "value" }', '{ "key": "notvalue" }', dict(accuracy=0)),
# Values can be numbers as well as strings
('{ "key": 100 }', '{ "key": 100 }', dict(accuracy=1.0)),
# Numerical values are not accepted if they differ
('{ "key": 100 }', '{ "key": 100.1 }', dict(accuracy=0)),
# Completion is accepted if it is found in an array of valid answers
('{ "key": 100 }', ['{ "key": 100.1 }', '{ "key": 100 }'], dict(accuracy=1.0)),
# Completion is not accepted if it is not found in an array of valid answers
('{ "key": 100 }', ['{ "key": 100.1 }', '{ "key": 99.9 }'], dict(accuracy=0)),
# Different keys do not match
('{ "key": "value" }', '{ "anotherkey": "value" }', dict(accuracy=0)),
# Missing keys do not match
(
'{ "key": "value" }',
'{ "key": "value", "anotherkey": "value" }',
dict(accuracy=0),
),
# Extra keys do not match
(
'{ "key": "value", "anotherkey": "value" }',
'{ "key": "value" }',
dict(accuracy=0),
),
# Lists are supported, and matched by element equality
('{ "key": [1.0,2.0,3.0] }', '{ "key": [1, 2, 3] }', dict(accuracy=1.0)),
# Lists of different lengths do not match
('{ "key": [1, 2, 3] }', '{ "key": [1, 2, 3, 3] }', dict(accuracy=0)),
# Lists that are not equal index-by-index do not match
('{ "key": [1, 2, 3] }', '{ "key": [1, 3, 2] }', dict(accuracy=0)),
# An empty list does not match a nonempty list
('{ "key": [] }', '{ "key": [1] }', dict(accuracy=0)),
# Completion with invalid JSON is not accepted
('{ "key": "value }', '{ "key": "value" }', dict(accuracy=0)),
],
)
def test_eval_sample(
completion: str,
ideal: list[str],
expected_metrics: dict[str, float],
) -> None:
eval = JsonMatch(
completion_fns=[TestCompletionFn(completion)],
samples_jsonl="",
eval_registry_path=Path("."),
)

recorder = DummyRecorder(None)
with recorder.as_default_recorder("x"), patch.object(
recorder, "record_metrics", wraps=recorder.record_metrics
) as record_metrics:
eval.eval_sample(dict(input=completion, ideal=ideal), None)
record_metrics.assert_called_once_with(**expected_metrics)


@mark.parametrize(
"sample, expected_error",
[
(None, AssertionError),
("", AssertionError),
(dict(ideal="world"), AssertionError), # Missing input
(dict(input="world"), AssertionError), # Missing ideal answer
],
)
def test_eval_sample_raises(sample: Any, expected_error: Type[Exception]) -> None:
eval = JsonMatch(
completion_fns=[DummyCompletionFn()],
samples_jsonl="",
eval_registry_path=Path("."),
)

with raises(expected_error):
eval.eval_sample(sample, None)
3 changes: 3 additions & 0 deletions evals/registry/data/icelandic-inflection-easy/samples.jsonl
Git LFS file not shown
3 changes: 3 additions & 0 deletions evals/registry/data/icelandic-inflection-hard/samples.jsonl
Git LFS file not shown
Git LFS file not shown
9 changes: 9 additions & 0 deletions evals/registry/evals/icelandic-inflection-easy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
icelandic-inflection-easy:
id: icelandic-inflection-easy.dev.v0
description: Test the model's ability to correctly inflect Icelandic noun phrases (easiest category)
metrics: [accuracy]

icelandic-inflection-easy.dev.v0:
class: evals.elsuite.basic.json_match:JsonMatch
args:
samples_jsonl: icelandic-inflection-easy/samples.jsonl
9 changes: 9 additions & 0 deletions evals/registry/evals/icelandic-inflection-hard.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
icelandic-inflection-hard:
id: icelandic-inflection-hard.dev.v0
description: Test the model's ability to correctly inflect Icelandic noun phrases (hard category)
metrics: [accuracy]

icelandic-inflection-hard.dev.v0:
class: evals.elsuite.basic.json_match:JsonMatch
args:
samples_jsonl: icelandic-inflection-hard/samples.jsonl
9 changes: 9 additions & 0 deletions evals/registry/evals/icelandic-inflection-medium.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
icelandic-inflection-medium:
id: icelandic-inflection-medium.dev.v0
description: Test the model's ability to correctly inflect Icelandic noun phrases (medium category)
metrics: [accuracy]

icelandic-inflection-medium.dev.v0:
class: evals.elsuite.basic.json_match:JsonMatch
args:
samples_jsonl: icelandic-inflection-medium/samples.jsonl
6 changes: 6 additions & 0 deletions evals/utils/test.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@


class TestCompletionResult(CompletionResult):

__test__ = False # Prevent pytest from trying to run this class as a test

def __init__(self, completion: str):
self.completion = completion

Expand All @@ -13,6 +16,9 @@ def get_completions(self) -> list[str]:


class TestCompletionFn(CompletionFn):

__test__ = False # Prevent pytest from trying to run this class as a test

def __init__(self, completion: str):
self.completion = completion

Expand Down
Loading