Skip to content

Commit

Permalink
Add customizable evaluation dimensions (#256)
Browse files Browse the repository at this point in the history
* add customizable evaluation dimensions

* add docs

* fix mypy error & refactor examples

* add docs for evaluation dimensions

* update docs and examples

* add test cases and fix mypy issue

* fix mypy issue

* Fix test_create_custom_dimension to use CustomEvaluationDimension.get(pk) (#262)

Co-authored-by: openhands <openhands@all-hands.dev>

* Fix/custom eval dimension test (#263)

* Fix test_create_custom_dimension to use CustomEvaluationDimension.get(pk)

* Update documentation for SotopiaDimension and EvaluationDimensionBuilder

* [autofix.ci] apply automated fixes

* Add API documentation for evaluation dimensions

* Refine API documentation for evaluation_dimensions.py to match style

* [autofix.ci] apply automated fixes

---------

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>

* add doc

---------

Co-authored-by: XuhuiZhou <zhouxuhui2018@gmail.com>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
  • Loading branch information
4 people authored Dec 8, 2024
1 parent 693f792 commit 5a9f4b7
Show file tree
Hide file tree
Showing 7 changed files with 595 additions and 1 deletion.
116 changes: 116 additions & 0 deletions docs/pages/concepts/evaluation_dimension.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
## Overview

Evaluation dimensions are used to evaluate the quality of social interactions.
In original Sotopia paper, there are 7 dimensions to evaluate the quality of social interactions, where we named them as `sotopia` evaluation dimensions:
- believability
- relationship
- knowledge
- secret
- social rules
- financial and material benefits
- goal

The `SotopiaDimensions` can be used directly without initializing the database. It provides a set of predefined evaluation dimensions that are ready to use for evaluating social interactions. For example,

```python
from sotopia.envs.parallel import ParallelSotopiaEnv
from sotopia.envs.evaluators import EvaluationForTwoAgents, ReachGoalLLMEvaluator, RuleBasedTerminatedEvaluator, SotopiaDimensions

env = ParallelSotopiaEnv(
env_profile=env_profile,
model_name=model_names["env"],
action_order="round-robin",
evaluators=[
RuleBasedTerminatedEvaluator(max_turn_number=20, max_stale_turn=2),
],
terminal_evaluators=[
ReachGoalLLMEvaluator(
model_names["env"],
EvaluationForTwoAgents[SotopiaDimensions], # type: ignore
# TODO check how to do type annotation
),
],
)
```


However we observe under many use cases people may want to evaluate with customized evaluation metrics, so we provide a way to build custom evaluation dimensions.
For a quick reference, you can directly check out the `examples/use_custom_dimensions.py`.

### CustomEvaluationDimension
The [`CustomEvaluationDimension`](/python_API/database/evaluation_dimensions) is a class that can be used to create a custom evaluation dimension.
There are four parameters:
- name: the name of the dimension
- description: the description of the dimension
- range_low: the minimum score of the dimension (should be an integer)
- range_high: the maximum score of the dimension (should be an integer)

### CustomEvaluationDimensionList
The [`CustomEvaluationDimensionList`](/python_API/database/evaluation_dimensions) is a class that can be used to create a custom evaluation dimension list based on the existing dimensions. It helps one to group multiple dimensions together for a specific use case.
There are two parameters:
- name: the name of the dimension list
- dimension_pks: the primary keys of the dimensions in the dimension list

### EvaluationDimensionBuilder
The [`EvaluationDimensionBuilder`](/python_API/database/evaluation_dimensions) is a class that can be used to generate a custom evaluation dimension model based on the existing dimensions.


## Usage
### Initialize the database
The default evaluation metric is still `SotopiaDimensions` in `sotopia.env.evaluators`.There is no `CustomEvaluationDimension` in the database by default. To initialize the database, please refer to `examples/use_custom_dimensions.py`.


### Use the custom evaluation dimensions
After you initialize your customized evaluation dimensions, you can choose to use any one of these methods provided below:

#### Method 1: Choose dimensions by names
```python
evaluation_dimensions = (
EvaluationDimensionBuilder.select_existing_dimension_model_by_name(
["transactivity", "verbal_equity"]
)
)
```

#### Method 2: Directly choose the grouped evaluation dimension list
```python
evaluation_dimensions = (
EvaluationDimensionBuilder.select_existing_dimension_model_by_list_name(
"sotopia"
)
)
```

#### Method 3: Build a custom evaluation dimension model temporarily
We provide multiple ways to build a custom evaluation dimension model with `EvaluationDimensionBuilder`, specifically:
- `generate_dimension_model`: build an evaluation dimension from existing dimension primary keys.
- `generate_dimension_model_from_dict`: build an evaluation dimension from a dictionary that specifies the parameters of the `CustomEvaluationDimension`. For example
```json
[
{
"name": "believability",
"description": "The believability of the interaction",
"range_low": 0,
"range_high": 10
},
...
]
```
- `select_existing_dimension_model_by_name`: build an evaluation dimension from existing dimension names. For example `['believability', 'goal']`
- `select_existing_dimension_model_by_list_name`: build an evaluation dimension from existing `CustomEvaluationDimensionList` list names. For example, directly use `sotopia`.


After you get the evaluation dimension model, you can pass it as a parameter for the `Evaluator`, for example,
```python
evaluation_dimensions = (
EvaluationDimensionBuilder.select_existing_dimension_model_by_list_name(
"sotopia"
)
)
terminal_evaluators=[
ReachGoalLLMEvaluator(
model_names["env"],
EvaluationForTwoAgents[evaluation_dimensions], # type: ignore
),
],
```
54 changes: 54 additions & 0 deletions docs/pages/python_API/database/evaluation_dimensions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# `evaluation_dimensions.py`

This module provides classes and utilities for defining and managing custom evaluation dimensions within the Sotopia environment. It includes classes for individual dimensions, lists of dimensions, and a builder for creating dimension models.

## Classes

### `CustomEvaluationDimension`

Represents a custom evaluation dimension with specific attributes such as name, description, and score range.

#### Attributes
- `name`: `str`. The name of the dimension.
- `description`: `str`. A brief description of the dimension.
- `range_low`: `int`. The minimum score for the dimension.
- `range_high`: `int`. The maximum score for the dimension.

### `CustomEvaluationDimensionList`

Groups multiple custom evaluation dimensions together.

#### Attributes
- `name`: `str`. The name of the dimension list.
- `dimension_pks`: `list[str]`. A list of primary keys for the dimensions included in the list.

### `EvaluationDimensionBuilder`

Provides utility methods to create and manage evaluation dimension models.

#### Methods
- `create_range_validator(low: int, high: int)`: Creates a validator for score ranges.

**Arguments:**
- `low`: `int`. The minimum score allowed.
- `high`: `int`. The maximum score allowed.

- `build_dimension_model(dimension_ids: list[str])`: Builds a dimension model from primary keys.

**Arguments:**
- `dimension_ids`: `list[str]`. A list of dimension primary keys.

- `build_dimension_model_from_dict(dimensions: list[dict[str, Union[str, int]]])`: Builds a dimension model from a dictionary.

**Arguments:**
- `dimensions`: `list[dict[str, Union[str, int]]]`. A list of dictionaries specifying dimension attributes.

- `select_existing_dimension_model_by_name(dimension_names: list[str])`: Selects a dimension model by dimension names.

**Arguments:**
- `dimension_names`: `list[str]`. A list of dimension names.

- `select_existing_dimension_model_by_list_name(list_name: str)`: Selects a dimension model by list name.

**Arguments:**
- `list_name`: `str`. The name of the dimension list.
17 changes: 16 additions & 1 deletion examples/experiment_eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
EnvAgentComboStorage,
EnvironmentProfile,
EpisodeLog,
EvaluationDimensionBuilder,
)
from sotopia.envs.evaluators import (
EvaluationForTwoAgents,
Expand All @@ -34,6 +35,7 @@
)
from sotopia.server import run_async_server
from sotopia_conf.gin_utils import parse_gin_flags, run
# from sotopia.database import EvaluationDimensionBuilder

_DEFAULT_GIN_SEARCH_PATHS = [
os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
Expand Down Expand Up @@ -109,6 +111,18 @@ def _iterate_env_agent_combo_not_in_db(
tag: str | None = None,
) -> Generator[EnvAgentCombo[Observation, AgentAction], None, None]:
"""We iterate over each environment and return the **first** env-agent combo that is not in the database."""
# loading evaluation metric
try:
evaluation_dimensions = EvaluationDimensionBuilder.select_existing_dimension_model_by_list_name(
"sotopia"
) # Initialize your customized dimension, please refer to `examples/use_custom_dimensions.py`
except Exception as e:
print(
"No customized evaluation dimensions found, using default SotopiaDimensions",
e,
)
evaluation_dimensions = SotopiaDimensions

if not env_ids:
env_ids = list(EnvironmentProfile.all_pks())
for env_id in env_ids:
Expand Down Expand Up @@ -152,7 +166,8 @@ def _iterate_env_agent_combo_not_in_db(
terminal_evaluators=[
ReachGoalLLMEvaluator(
model_names["env"],
EvaluationForTwoAgents[SotopiaDimensions],
EvaluationForTwoAgents[evaluation_dimensions], # type: ignore
# TODO check how to do type annotation
),
],
)
Expand Down
Loading

0 comments on commit 5a9f4b7

Please sign in to comment.