Add customizable evaluation dimensions (#256)

* add customizable evaluation dimensions * add docs * fix mypy error & refactor examples * add docs for evaluation dimensions * update docs and examples * add test cases and fix mypy issue * fix mypy issue * Fix test_create_custom_dimension to use CustomEvaluationDimension.get(pk) (#262) Co-authored-by: openhands <openhands@all-hands.dev> * Fix/custom eval dimension test (#263) * Fix test_create_custom_dimension to use CustomEvaluationDimension.get(pk) * Update documentation for SotopiaDimension and EvaluationDimensionBuilder * [autofix.ci] apply automated fixes * Add API documentation for evaluation dimensions * Refine API documentation for evaluation_dimensions.py to match style * [autofix.ci] apply automated fixes --------- Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> * add doc --------- Co-authored-by: XuhuiZhou <zhouxuhui2018@gmail.com> Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
sotopia-lab · Dec 8, 2024 · 5a9f4b7 · 5a9f4b7
1 parent 693f792
commit 5a9f4b7
Show file tree

Hide file tree

Showing 7 changed files with 595 additions and 1 deletion.
diff --git a/docs/pages/concepts/evaluation_dimension.md b/docs/pages/concepts/evaluation_dimension.md
@@ -0,0 +1,116 @@
+## Overview
+
+Evaluation dimensions are used to evaluate the quality of social interactions.
+In original Sotopia paper, there are 7 dimensions to evaluate the quality of social interactions, where we named them as `sotopia` evaluation dimensions:
+- believability
+- relationship
+- knowledge
+- secret
+- social rules
+- financial and material benefits
+- goal
+
+The `SotopiaDimensions` can be used directly without initializing the database. It provides a set of predefined evaluation dimensions that are ready to use for evaluating social interactions. For example,
+
+```python
+from sotopia.envs.parallel import ParallelSotopiaEnv
+from sotopia.envs.evaluators import EvaluationForTwoAgents, ReachGoalLLMEvaluator, RuleBasedTerminatedEvaluator, SotopiaDimensions
+
+env = ParallelSotopiaEnv(
+    env_profile=env_profile,
+        model_name=model_names["env"],
+        action_order="round-robin",
+        evaluators=[
+            RuleBasedTerminatedEvaluator(max_turn_number=20, max_stale_turn=2),
+        ],
+        terminal_evaluators=[
+            ReachGoalLLMEvaluator(
+                model_names["env"],
+                EvaluationForTwoAgents[SotopiaDimensions],  # type: ignore
+                # TODO check how to do type annotation
+            ),
+        ],
+    )
+```
+
+
+However we observe under many use cases people may want to evaluate with customized evaluation metrics, so we provide a way to build custom evaluation dimensions.
+For a quick reference, you can directly check out the `examples/use_custom_dimensions.py`.
+
+### CustomEvaluationDimension
+The [`CustomEvaluationDimension`](/python_API/database/evaluation_dimensions) is a class that can be used to create a custom evaluation dimension.
+There are four parameters:
+- name: the name of the dimension
+- description: the description of the dimension
+- range_low: the minimum score of the dimension (should be an integer)
+- range_high: the maximum score of the dimension (should be an integer)
+
+### CustomEvaluationDimensionList
+The [`CustomEvaluationDimensionList`](/python_API/database/evaluation_dimensions) is a class that can be used to create a custom evaluation dimension list based on the existing dimensions. It helps one to group multiple dimensions together for a specific use case.
+There are two parameters:
+- name: the name of the dimension list
+- dimension_pks: the primary keys of the dimensions in the dimension list
+
+### EvaluationDimensionBuilder
+The [`EvaluationDimensionBuilder`](/python_API/database/evaluation_dimensions) is a class that can be used to generate a custom evaluation dimension model based on the existing dimensions.
+
+
+## Usage
+### Initialize the database
+The default evaluation metric is still `SotopiaDimensions` in `sotopia.env.evaluators`.There is no `CustomEvaluationDimension` in the database by default. To initialize the database, please refer to `examples/use_custom_dimensions.py`.
+
+
+### Use the custom evaluation dimensions
+After you initialize your customized evaluation dimensions, you can choose to use any one of these methods provided below:
+
+#### Method 1: Choose dimensions by names
+```python
+evaluation_dimensions = (
+    EvaluationDimensionBuilder.select_existing_dimension_model_by_name(
+        ["transactivity", "verbal_equity"]
+    )
+)
+```
+
+#### Method 2: Directly choose the grouped evaluation dimension list
+```python
+evaluation_dimensions = (
+    EvaluationDimensionBuilder.select_existing_dimension_model_by_list_name(
+        "sotopia"
+    )
+)
+```
+
+#### Method 3: Build a custom evaluation dimension model temporarily
+We provide multiple ways to build a custom evaluation dimension model with `EvaluationDimensionBuilder`, specifically:
+- `generate_dimension_model`: build an evaluation dimension from existing dimension primary keys.
+- `generate_dimension_model_from_dict`: build an evaluation dimension from a dictionary that specifies the parameters of the `CustomEvaluationDimension`. For example
+```json
+[
+    {
+        "name": "believability",
+        "description": "The believability of the interaction",
+        "range_low": 0,
+        "range_high": 10
+    },
+    ...
+]
+```
+- `select_existing_dimension_model_by_name`: build an evaluation dimension from existing dimension names. For example `['believability', 'goal']`
+- `select_existing_dimension_model_by_list_name`: build an evaluation dimension from existing `CustomEvaluationDimensionList` list names. For example, directly use `sotopia`.
+
+
+After you get the evaluation dimension model, you can pass it as a parameter for the `Evaluator`, for example,
+```python
+evaluation_dimensions = (
+    EvaluationDimensionBuilder.select_existing_dimension_model_by_list_name(
+        "sotopia"
+    )
+)
+terminal_evaluators=[
+    ReachGoalLLMEvaluator(
+        model_names["env"],
+        EvaluationForTwoAgents[evaluation_dimensions],  # type: ignore
+    ),
+],
+```
diff --git a/docs/pages/python_API/database/evaluation_dimensions.md b/docs/pages/python_API/database/evaluation_dimensions.md
@@ -0,0 +1,54 @@
+# `evaluation_dimensions.py`
+
+This module provides classes and utilities for defining and managing custom evaluation dimensions within the Sotopia environment. It includes classes for individual dimensions, lists of dimensions, and a builder for creating dimension models.
+
+## Classes
+
+### `CustomEvaluationDimension`
+
+Represents a custom evaluation dimension with specific attributes such as name, description, and score range.
+
+#### Attributes
+- `name`: `str`. The name of the dimension.
+- `description`: `str`. A brief description of the dimension.
+- `range_low`: `int`. The minimum score for the dimension.
+- `range_high`: `int`. The maximum score for the dimension.
+
+### `CustomEvaluationDimensionList`
+
+Groups multiple custom evaluation dimensions together.
+
+#### Attributes
+- `name`: `str`. The name of the dimension list.
+- `dimension_pks`: `list[str]`. A list of primary keys for the dimensions included in the list.
+
+### `EvaluationDimensionBuilder`
+
+Provides utility methods to create and manage evaluation dimension models.
+
+#### Methods
+- `create_range_validator(low: int, high: int)`: Creates a validator for score ranges.
+
+  **Arguments:**
+  - `low`: `int`. The minimum score allowed.
+  - `high`: `int`. The maximum score allowed.
+
+- `build_dimension_model(dimension_ids: list[str])`: Builds a dimension model from primary keys.
+
+  **Arguments:**
+  - `dimension_ids`: `list[str]`. A list of dimension primary keys.
+
+- `build_dimension_model_from_dict(dimensions: list[dict[str, Union[str, int]]])`: Builds a dimension model from a dictionary.
+
+  **Arguments:**
+  - `dimensions`: `list[dict[str, Union[str, int]]]`. A list of dictionaries specifying dimension attributes.
+
+- `select_existing_dimension_model_by_name(dimension_names: list[str])`: Selects a dimension model by dimension names.
+
+  **Arguments:**
+  - `dimension_names`: `list[str]`. A list of dimension names.
+
+- `select_existing_dimension_model_by_list_name(list_name: str)`: Selects a dimension model by list name.
+
+  **Arguments:**
+  - `list_name`: `str`. The name of the dimension list.
diff --git a/examples/experiment_eval.py b/examples/experiment_eval.py
@@ -17,6 +17,7 @@
     EnvAgentComboStorage,
     EnvironmentProfile,
     EpisodeLog,
+    EvaluationDimensionBuilder,
 )
 from sotopia.envs.evaluators import (
     EvaluationForTwoAgents,
@@ -34,6 +35,7 @@
 )
 from sotopia.server import run_async_server
 from sotopia_conf.gin_utils import parse_gin_flags, run
+# from sotopia.database import EvaluationDimensionBuilder
 
 _DEFAULT_GIN_SEARCH_PATHS = [
     os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
@@ -109,6 +111,18 @@ def _iterate_env_agent_combo_not_in_db(
     tag: str | None = None,
 ) -> Generator[EnvAgentCombo[Observation, AgentAction], None, None]:
     """We iterate over each environment and return the **first** env-agent combo that is not in the database."""
+    # loading evaluation metric
+    try:
+        evaluation_dimensions = EvaluationDimensionBuilder.select_existing_dimension_model_by_list_name(
+            "sotopia"
+        )  # Initialize your customized dimension, please refer to `examples/use_custom_dimensions.py`
+    except Exception as e:
+        print(
+            "No customized evaluation dimensions found, using default SotopiaDimensions",
+            e,
+        )
+        evaluation_dimensions = SotopiaDimensions
+
     if not env_ids:
         env_ids = list(EnvironmentProfile.all_pks())
     for env_id in env_ids:
@@ -152,7 +166,8 @@ def _iterate_env_agent_combo_not_in_db(
                 terminal_evaluators=[
                     ReachGoalLLMEvaluator(
                         model_names["env"],
-                        EvaluationForTwoAgents[SotopiaDimensions],
+                        EvaluationForTwoAgents[evaluation_dimensions],  # type: ignore
+                        # TODO check how to do type annotation
                     ),
                 ],
             )