Skip to content

Commit

Permalink
Merge branch 'demo' into feature/sotopia-demo-ui
Browse files Browse the repository at this point in the history
  • Loading branch information
XuhuiZhou committed Dec 15, 2024
2 parents 1d743c0 + ab6903a commit 9e3651b
Show file tree
Hide file tree
Showing 30 changed files with 1,794 additions and 4,288 deletions.
116 changes: 116 additions & 0 deletions docs/pages/concepts/evaluation_dimension.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
## Overview

Evaluation dimensions are used to evaluate the quality of social interactions.
In original Sotopia paper, there are 7 dimensions to evaluate the quality of social interactions, where we named them as `sotopia` evaluation dimensions:
- believability
- relationship
- knowledge
- secret
- social rules
- financial and material benefits
- goal

The `SotopiaDimensions` can be used directly without initializing the database. It provides a set of predefined evaluation dimensions that are ready to use for evaluating social interactions. For example,

```python
from sotopia.envs.parallel import ParallelSotopiaEnv
from sotopia.envs.evaluators import EvaluationForTwoAgents, ReachGoalLLMEvaluator, RuleBasedTerminatedEvaluator, SotopiaDimensions

env = ParallelSotopiaEnv(
env_profile=env_profile,
model_name=model_names["env"],
action_order="round-robin",
evaluators=[
RuleBasedTerminatedEvaluator(max_turn_number=20, max_stale_turn=2),
],
terminal_evaluators=[
ReachGoalLLMEvaluator(
model_names["env"],
EvaluationForTwoAgents[SotopiaDimensions], # type: ignore
# TODO check how to do type annotation
),
],
)
```


However we observe under many use cases people may want to evaluate with customized evaluation metrics, so we provide a way to build custom evaluation dimensions.
For a quick reference, you can directly check out the `examples/use_custom_dimensions.py`.

### CustomEvaluationDimension
The [`CustomEvaluationDimension`](/python_API/database/evaluation_dimensions) is a class that can be used to create a custom evaluation dimension.
There are four parameters:
- name: the name of the dimension
- description: the description of the dimension
- range_low: the minimum score of the dimension (should be an integer)
- range_high: the maximum score of the dimension (should be an integer)

### CustomEvaluationDimensionList
The [`CustomEvaluationDimensionList`](/python_API/database/evaluation_dimensions) is a class that can be used to create a custom evaluation dimension list based on the existing dimensions. It helps one to group multiple dimensions together for a specific use case.
There are two parameters:
- name: the name of the dimension list
- dimension_pks: the primary keys of the dimensions in the dimension list

### EvaluationDimensionBuilder
The [`EvaluationDimensionBuilder`](/python_API/database/evaluation_dimensions) is a class that can be used to generate a custom evaluation dimension model based on the existing dimensions.


## Usage
### Initialize the database
The default evaluation metric is still `SotopiaDimensions` in `sotopia.env.evaluators`.There is no `CustomEvaluationDimension` in the database by default. To initialize the database, please refer to `examples/use_custom_dimensions.py`.


### Use the custom evaluation dimensions
After you initialize your customized evaluation dimensions, you can choose to use any one of these methods provided below:

#### Method 1: Choose dimensions by names
```python
evaluation_dimensions = (
EvaluationDimensionBuilder.select_existing_dimension_model_by_name(
["transactivity", "verbal_equity"]
)
)
```

#### Method 2: Directly choose the grouped evaluation dimension list
```python
evaluation_dimensions = (
EvaluationDimensionBuilder.select_existing_dimension_model_by_list_name(
"sotopia"
)
)
```

#### Method 3: Build a custom evaluation dimension model temporarily
We provide multiple ways to build a custom evaluation dimension model with `EvaluationDimensionBuilder`, specifically:
- `generate_dimension_model`: build an evaluation dimension from existing dimension primary keys.
- `generate_dimension_model_from_dict`: build an evaluation dimension from a dictionary that specifies the parameters of the `CustomEvaluationDimension`. For example
```json
[
{
"name": "believability",
"description": "The believability of the interaction",
"range_low": 0,
"range_high": 10
},
...
]
```
- `select_existing_dimension_model_by_name`: build an evaluation dimension from existing dimension names. For example `['believability', 'goal']`
- `select_existing_dimension_model_by_list_name`: build an evaluation dimension from existing `CustomEvaluationDimensionList` list names. For example, directly use `sotopia`.


After you get the evaluation dimension model, you can pass it as a parameter for the `Evaluator`, for example,
```python
evaluation_dimensions = (
EvaluationDimensionBuilder.select_existing_dimension_model_by_list_name(
"sotopia"
)
)
terminal_evaluators=[
ReachGoalLLMEvaluator(
model_names["env"],
EvaluationForTwoAgents[evaluation_dimensions], # type: ignore
),
],
```
54 changes: 54 additions & 0 deletions docs/pages/python_API/database/evaluation_dimensions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# `evaluation_dimensions.py`

This module provides classes and utilities for defining and managing custom evaluation dimensions within the Sotopia environment. It includes classes for individual dimensions, lists of dimensions, and a builder for creating dimension models.

## Classes

### `CustomEvaluationDimension`

Represents a custom evaluation dimension with specific attributes such as name, description, and score range.

#### Attributes
- `name`: `str`. The name of the dimension.
- `description`: `str`. A brief description of the dimension.
- `range_low`: `int`. The minimum score for the dimension.
- `range_high`: `int`. The maximum score for the dimension.

### `CustomEvaluationDimensionList`

Groups multiple custom evaluation dimensions together.

#### Attributes
- `name`: `str`. The name of the dimension list.
- `dimension_pks`: `list[str]`. A list of primary keys for the dimensions included in the list.

### `EvaluationDimensionBuilder`

Provides utility methods to create and manage evaluation dimension models.

#### Methods
- `create_range_validator(low: int, high: int)`: Creates a validator for score ranges.

**Arguments:**
- `low`: `int`. The minimum score allowed.
- `high`: `int`. The maximum score allowed.

- `build_dimension_model(dimension_ids: list[str])`: Builds a dimension model from primary keys.

**Arguments:**
- `dimension_ids`: `list[str]`. A list of dimension primary keys.

- `build_dimension_model_from_dict(dimensions: list[dict[str, Union[str, int]]])`: Builds a dimension model from a dictionary.

**Arguments:**
- `dimensions`: `list[dict[str, Union[str, int]]]`. A list of dictionaries specifying dimension attributes.

- `select_existing_dimension_model_by_name(dimension_names: list[str])`: Selects a dimension model by dimension names.

**Arguments:**
- `dimension_names`: `list[str]`. A list of dimension names.

- `select_existing_dimension_model_by_list_name(list_name: str)`: Selects a dimension model by list name.

**Arguments:**
- `list_name`: `str`. The name of the dimension list.
17 changes: 16 additions & 1 deletion examples/experiment_eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
EnvAgentComboStorage,
EnvironmentProfile,
EpisodeLog,
EvaluationDimensionBuilder,
)
from sotopia.envs.evaluators import (
EvaluationForTwoAgents,
Expand All @@ -34,6 +35,7 @@
)
from sotopia.server import run_async_server
from sotopia_conf.gin_utils import parse_gin_flags, run
# from sotopia.database import EvaluationDimensionBuilder

_DEFAULT_GIN_SEARCH_PATHS = [
os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
Expand Down Expand Up @@ -109,6 +111,18 @@ def _iterate_env_agent_combo_not_in_db(
tag: str | None = None,
) -> Generator[EnvAgentCombo[Observation, AgentAction], None, None]:
"""We iterate over each environment and return the **first** env-agent combo that is not in the database."""
# loading evaluation metric
try:
evaluation_dimensions = EvaluationDimensionBuilder.select_existing_dimension_model_by_list_name(
"sotopia"
) # Initialize your customized dimension, please refer to `examples/use_custom_dimensions.py`
except Exception as e:
print(
"No customized evaluation dimensions found, using default SotopiaDimensions",
e,
)
evaluation_dimensions = SotopiaDimensions

if not env_ids:
env_ids = list(EnvironmentProfile.all_pks())
for env_id in env_ids:
Expand Down Expand Up @@ -152,7 +166,8 @@ def _iterate_env_agent_combo_not_in_db(
terminal_evaluators=[
ReachGoalLLMEvaluator(
model_names["env"],
EvaluationForTwoAgents[SotopiaDimensions],
EvaluationForTwoAgents[evaluation_dimensions], # type: ignore
# TODO check how to do type annotation
),
],
)
Expand Down
2 changes: 2 additions & 0 deletions examples/experimental/nodes/initial_message_node.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ def __init__(
input_tick_channel: str,
output_channels: list[str],
env_scenario: str,
node_name: str,
redis_url: str = "redis://localhost:6379/0",
):
super().__init__(
Expand All @@ -26,6 +27,7 @@ def __init__(
(output_channel, Text) for output_channel in output_channels
],
redis_url=redis_url,
node_name=node_name,
)
self.env_scenario = env_scenario
self.output_channels = output_channels
Expand Down
113 changes: 113 additions & 0 deletions examples/experimental/sotopia_original_replica/llm_agent_sotopia.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
import logging
import sys
from rich.logging import RichHandler

from aact import NodeFactory

from sotopia.experimental.agents.base_agent import BaseAgent
from sotopia.experimental.agents.datamodels import Observation, AgentAction

from sotopia.generation_utils import agenerate
from sotopia.generation_utils.generate import StrOutputParser

# Check Python version
if sys.version_info >= (3, 11):
pass
else:
pass

# Configure logging
FORMAT = "%(asctime)s - %(levelname)s - %(name)s - %(message)s"
logging.basicConfig(
level=logging.WARNING,
format=FORMAT,
datefmt="[%X]",
handlers=[RichHandler()],
)


@NodeFactory.register("llm_agent")
class LLMAgent(BaseAgent[Observation, AgentAction]):
def __init__(
self,
input_channels: list[str],
output_channel: str,
query_interval: int,
agent_name: str,
node_name: str,
goal: str,
model_name: str,
redis_url: str,
):
super().__init__(
[(input_channel, Observation) for input_channel in input_channels],
[(output_channel, AgentAction)],
redis_url,
node_name,
)
self.output_channel = output_channel
self.query_interval = query_interval
self.count_ticks = 0
self.message_history: list[Observation] = []
self.name = agent_name
self.model_name = model_name
self.goal = goal

def _format_message_history(self, message_history: list[Observation]) -> str:
## TODO: akhatua Fix the mapping of action to be gramatically correct
return "\n".join(message.to_natural_language() for message in message_history)

async def aact(self, obs: Observation) -> AgentAction:
if obs.turn_number == -1:
return AgentAction(
agent_name=self.name,
output_channel=self.output_channel,
action_type="none",
argument=self.model_name,
)

self.message_history.append(obs)

if len(obs.available_actions) == 1 and "none" in obs.available_actions:
return AgentAction(
agent_name=self.name,
output_channel=self.output_channel,
action_type="none",
argument="",
)
elif len(obs.available_actions) == 1 and "leave" in obs.available_actions:
self.shutdown_event.set()
return AgentAction(
agent_name=self.name,
output_channel=self.output_channel,
action_type="leave",
argument="",
)
else:
history = self._format_message_history(self.message_history)
action: str = await agenerate(
model_name=self.model_name,
template="Imagine that you are a friend of the other persons. Here is the "
"conversation between you and them.\n"
"You are {agent_name} in the conversation.\n"
"{message_history}\n"
"and you plan to {goal}.\n"
"You can choose to interrupt the other person "
"by saying something or not to interrupt by outputting notiong. What would you say? "
"Please only output a sentence or not outputting anything."
"{format_instructions}",
input_values={
"message_history": history,
"goal": self.goal,
"agent_name": self.name,
},
temperature=0.7,
output_parser=StrOutputParser(),
)

return AgentAction(
agent_name=self.name,
output_channel=self.output_channel,
action_type="speak",
argument=action,
)
Loading

0 comments on commit 9e3651b

Please sign in to comment.