-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Icelandic inflection eval; JsonMatch eval function #1387
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,3 +12,6 @@ venv/ | |
.idea/ | ||
|
||
build | ||
|
||
openai-key.txt | ||
*.code-workspace |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
import json | ||
import random | ||
from typing import Any, Dict, List, Mapping, Union, cast | ||
|
||
import numpy as np | ||
|
||
import evals | ||
from evals.api import CompletionFn | ||
from evals.record import RecorderBase | ||
|
||
|
||
def json_match(sampled_json: Any, correct_json: Any) -> bool: | ||
"""Return True if the sampled completion in JSON format | ||
matches a correct answer, component by component""" | ||
if sampled_json is None or correct_json is None: | ||
# Missing values are never correct | ||
return False | ||
if isinstance(sampled_json, dict): | ||
if isinstance(correct_json, dict): | ||
sample = cast(Mapping[str, Any], sampled_json) | ||
correct = cast(Mapping[str, Any], correct_json) | ||
all_keys = set(sample.keys()) | set(correct.keys()) | ||
return all(json_match(sample.get(key), correct.get(key)) for key in all_keys) | ||
else: | ||
return False | ||
elif isinstance(sampled_json, list): | ||
if isinstance(correct_json, list): | ||
slist = cast(List[Any], sampled_json) | ||
clist = cast(List[Any], correct_json) | ||
if len(slist) != len(clist): | ||
# Lists must have the same length | ||
return False | ||
return all(json_match(s, c) for s, c in zip(slist, clist)) | ||
else: | ||
return False | ||
# Not a structured item: do a direct comparison | ||
return sampled_json == correct_json | ||
|
||
|
||
class JsonMatch(evals.Eval): | ||
|
||
"""Compares a JSON completion with one or more ideal answers, | ||
also coded in JSON. The decoded JSON objects are compared | ||
elementwise and must match exactly.""" | ||
|
||
def __init__( | ||
self, | ||
completion_fns: list[CompletionFn], | ||
samples_jsonl: str, | ||
*args: Any, | ||
max_tokens: int = 512, # Increase this for longer JSON completions | ||
**kwargs: Any, | ||
): | ||
super().__init__(completion_fns, *args, **kwargs) | ||
assert len(completion_fns) == 1, "JsonMatch only supports one completion fn" | ||
self.max_tokens = max_tokens | ||
self.samples_jsonl = samples_jsonl | ||
|
||
def eval_sample(self, sample: Any, rng: random.Random): | ||
del rng | ||
|
||
assert isinstance(sample, dict), "sample must be a dict" | ||
assert "input" in sample, "sample must have an 'input' key" | ||
assert "ideal" in sample, "sample must have an 'ideal' key" | ||
|
||
prompt = cast(str, sample["input"]) | ||
correct_answers = cast(Union[str, List[str]], sample["ideal"]) | ||
if not isinstance(correct_answers, list): | ||
correct_answers = [correct_answers] | ||
|
||
result = self.completion_fn( | ||
prompt=prompt, | ||
temperature=0.0, # Q: why are these hardcoded? | ||
max_tokens=self.max_tokens, | ||
) | ||
sampled = result.get_completions()[0] | ||
|
||
sampled_json: Any | ||
try: | ||
sampled_json = json.loads(sampled) | ||
except ValueError: | ||
# If the sampled string is not valid JSON, it will never match | ||
sampled_json = None | ||
|
||
# Allow the following to raise ValueError; the correct answers | ||
# should always be valid JSON | ||
correct_json = [json.loads(correct_answer) for correct_answer in correct_answers] | ||
|
||
matches = [json_match(sampled_json, cj) for cj in correct_json] | ||
|
||
evals.record.record_match( | ||
True in matches, | ||
expected=correct_answers, | ||
picked=[sampled for i in range(len(correct_answers)) if matches[i]], | ||
) | ||
evals.record.record_metrics( | ||
accuracy=float(True in matches), | ||
) | ||
|
||
def run(self, recorder: RecorderBase) -> Dict[str, float]: | ||
samples = self.get_samples() | ||
self.eval_all_samples(recorder, samples) | ||
|
||
return { | ||
"accuracy": np.mean(recorder.get_scores("accuracy")), | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
from pathlib import Path | ||
from typing import Any, Type | ||
|
||
from mock import patch | ||
from pytest import mark, raises | ||
|
||
from evals.api import DummyCompletionFn | ||
from evals.elsuite.basic.json_match import JsonMatch | ||
from evals.record import DummyRecorder | ||
from evals.utils.test import TestCompletionFn | ||
|
||
|
||
@mark.parametrize( | ||
"completion, ideal, expected_metrics", | ||
[ | ||
# Basic match | ||
('{ "key": "value" }', '{ "key": "value" }', dict(accuracy=1.0)), | ||
# Whitespace is not significant | ||
('{\n "key":"value"\n }\n', '{ "key": "value" }', dict(accuracy=1.0)), | ||
# Key order is not significant | ||
( | ||
'{ "key2": "foo", "key1": "bar" }', | ||
'{ "key1": "bar", "key2": "foo" }', | ||
dict(accuracy=1.0), | ||
), | ||
# No match if values are different | ||
('{ "key": "value" }', '{ "key": "notvalue" }', dict(accuracy=0)), | ||
# Values can be numbers as well as strings | ||
('{ "key": 100 }', '{ "key": 100 }', dict(accuracy=1.0)), | ||
# Numerical values are not accepted if they differ | ||
('{ "key": 100 }', '{ "key": 100.1 }', dict(accuracy=0)), | ||
# Completion is accepted if it is found in an array of valid answers | ||
('{ "key": 100 }', ['{ "key": 100.1 }', '{ "key": 100 }'], dict(accuracy=1.0)), | ||
# Completion is not accepted if it is not found in an array of valid answers | ||
('{ "key": 100 }', ['{ "key": 100.1 }', '{ "key": 99.9 }'], dict(accuracy=0)), | ||
# Different keys do not match | ||
('{ "key": "value" }', '{ "anotherkey": "value" }', dict(accuracy=0)), | ||
# Missing keys do not match | ||
( | ||
'{ "key": "value" }', | ||
'{ "key": "value", "anotherkey": "value" }', | ||
dict(accuracy=0), | ||
), | ||
# Extra keys do not match | ||
( | ||
'{ "key": "value", "anotherkey": "value" }', | ||
'{ "key": "value" }', | ||
dict(accuracy=0), | ||
), | ||
# Lists are supported, and matched by element equality | ||
('{ "key": [1.0,2.0,3.0] }', '{ "key": [1, 2, 3] }', dict(accuracy=1.0)), | ||
# Lists of different lengths do not match | ||
('{ "key": [1, 2, 3] }', '{ "key": [1, 2, 3, 3] }', dict(accuracy=0)), | ||
# Lists that are not equal index-by-index do not match | ||
('{ "key": [1, 2, 3] }', '{ "key": [1, 3, 2] }', dict(accuracy=0)), | ||
# An empty list does not match a nonempty list | ||
('{ "key": [] }', '{ "key": [1] }', dict(accuracy=0)), | ||
# Completion with invalid JSON is not accepted | ||
('{ "key": "value }', '{ "key": "value" }', dict(accuracy=0)), | ||
], | ||
) | ||
def test_eval_sample( | ||
completion: str, | ||
ideal: list[str], | ||
expected_metrics: dict[str, float], | ||
) -> None: | ||
eval = JsonMatch( | ||
completion_fns=[TestCompletionFn(completion)], | ||
samples_jsonl="", | ||
eval_registry_path=Path("."), | ||
) | ||
|
||
recorder = DummyRecorder(None) | ||
with recorder.as_default_recorder("x"), patch.object( | ||
recorder, "record_metrics", wraps=recorder.record_metrics | ||
) as record_metrics: | ||
eval.eval_sample(dict(input=completion, ideal=ideal), None) | ||
record_metrics.assert_called_once_with(**expected_metrics) | ||
|
||
|
||
@mark.parametrize( | ||
"sample, expected_error", | ||
[ | ||
(None, AssertionError), | ||
("", AssertionError), | ||
(dict(ideal="world"), AssertionError), # Missing input | ||
(dict(input="world"), AssertionError), # Missing ideal answer | ||
], | ||
) | ||
def test_eval_sample_raises(sample: Any, expected_error: Type[Exception]) -> None: | ||
eval = JsonMatch( | ||
completion_fns=[DummyCompletionFn()], | ||
samples_jsonl="", | ||
eval_registry_path=Path("."), | ||
) | ||
|
||
with raises(expected_error): | ||
eval.eval_sample(sample, None) |
Git LFS file not shown
Git LFS file not shown
3 changes: 3 additions & 0 deletions
3
evals/registry/data/icelandic-inflection-medium/samples.jsonl
Git LFS file not shown
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
icelandic-inflection-easy: | ||
id: icelandic-inflection-easy.dev.v0 | ||
description: Test the model's ability to correctly inflect Icelandic noun phrases (easiest category) | ||
metrics: [accuracy] | ||
|
||
icelandic-inflection-easy.dev.v0: | ||
class: evals.elsuite.basic.json_match:JsonMatch | ||
args: | ||
samples_jsonl: icelandic-inflection-easy/samples.jsonl |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
icelandic-inflection-hard: | ||
id: icelandic-inflection-hard.dev.v0 | ||
description: Test the model's ability to correctly inflect Icelandic noun phrases (hard category) | ||
metrics: [accuracy] | ||
|
||
icelandic-inflection-hard.dev.v0: | ||
class: evals.elsuite.basic.json_match:JsonMatch | ||
args: | ||
samples_jsonl: icelandic-inflection-hard/samples.jsonl |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
icelandic-inflection-medium: | ||
id: icelandic-inflection-medium.dev.v0 | ||
description: Test the model's ability to correctly inflect Icelandic noun phrases (medium category) | ||
metrics: [accuracy] | ||
|
||
icelandic-inflection-medium.dev.v0: | ||
class: evals.elsuite.basic.json_match:JsonMatch | ||
args: | ||
samples_jsonl: icelandic-inflection-medium/samples.jsonl |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We set temp=0 for determinism in running the eval, otherwise results won't necessarily be replicable.
If there's a clear need for nonzero temp in the eval, we can definitely add it as an arg!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, understood - this comment and question was in the original FuzzyMatch code that I used as a template for JsonMatch, so I just left it in ;-)