From ff994b5ae74e491ecd6f35ae118b353d19ad9a86 Mon Sep 17 00:00:00 2001 From: James Aung <129281094+james-aung@users.noreply.github.com> Date: Tue, 19 Mar 2024 07:59:17 -0700 Subject: [PATCH] Add In-Context RL eval (#1491) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit # Thank you for contributing an eval! ♥️ 🚨 Please make sure your PR follows these guidelines, **failure to follow the guidelines below will result in the PR being closed automatically**. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access be granted. 🚨 **PLEASE READ THIS**: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject it since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** Also, please note that we're using **Git LFS** for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available [here](https://git-lfs.com). ## Eval details 📑 ### Eval name In-Context RL ### Eval description We evaluate the ability to solve RL environments simply by interacting with them in-context, without dedicated training or fine-tuning. ### What makes this a useful eval? AI R&D ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high-quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your YAML is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (). - [ ] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the commits on the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgment We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted. ### Submit eval - [x] I have filled out all required fields of this form - [x] I have used **Git LFS** for the Eval JSON data - [x] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `mypy`, `black`, `isort`, `autoflake` and `ruff` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here:
View evals in JSON ### Eval ```jsonl INSERT_EVAL_HERE ```
--- evals/elsuite/incontext_rl/README.md | 74 ++++ evals/elsuite/incontext_rl/anti-cot_solver.py | 38 ++ evals/elsuite/incontext_rl/baselines.py | 118 +++++ evals/elsuite/incontext_rl/defaults.py | 30 ++ evals/elsuite/incontext_rl/env_setup.py | 12 + evals/elsuite/incontext_rl/eval.py | 299 +++++++++++++ evals/elsuite/incontext_rl/requirements.txt | 3 + .../incontext_rl/scripts/plot_experiments.py | 363 ++++++++++++++++ .../scripts/qlearning_baseline.ipynb | 402 ++++++++++++++++++ .../incontext_rl/scripts/run_experiments.sh | 39 ++ .../registry/data/incontext_rl/samples.jsonl | 3 + .../data/incontext_rl/samples_dev.jsonl | 3 + .../incontext_rl/samples_gymnasium_only.jsonl | 3 + evals/registry/evals/incontext_rl.yaml | 62 +++ evals/registry/solvers/incontext_rl.yaml | 27 ++ pyproject.toml | 1 + 16 files changed, 1477 insertions(+) create mode 100644 evals/elsuite/incontext_rl/README.md create mode 100644 evals/elsuite/incontext_rl/anti-cot_solver.py create mode 100644 evals/elsuite/incontext_rl/baselines.py create mode 100644 evals/elsuite/incontext_rl/defaults.py create mode 100644 evals/elsuite/incontext_rl/env_setup.py create mode 100644 evals/elsuite/incontext_rl/eval.py create mode 100644 evals/elsuite/incontext_rl/requirements.txt create mode 100644 evals/elsuite/incontext_rl/scripts/plot_experiments.py create mode 100644 evals/elsuite/incontext_rl/scripts/qlearning_baseline.ipynb create mode 100755 evals/elsuite/incontext_rl/scripts/run_experiments.sh create mode 100644 evals/registry/data/incontext_rl/samples.jsonl create mode 100644 evals/registry/data/incontext_rl/samples_dev.jsonl create mode 100644 evals/registry/data/incontext_rl/samples_gymnasium_only.jsonl create mode 100644 evals/registry/evals/incontext_rl.yaml create mode 100644 evals/registry/solvers/incontext_rl.yaml diff --git a/evals/elsuite/incontext_rl/README.md b/evals/elsuite/incontext_rl/README.md new file mode 100644 index 0000000000..69dbde303b --- /dev/null +++ b/evals/elsuite/incontext_rl/README.md @@ -0,0 +1,74 @@ +# In-Context RL + +This eval tests models' ability to solve RL environments simply by interacting with them in-context, without dedicated training or fine-tuning. + +## Usage + +Run with: + +```bash +oaieval incontext_rl +``` + +For examples of tested solvers, see [`./scripts/run_experiments.sh`](./scripts/run_experiments.sh). + +## Dataset + +The eval is currently set up to test models on the following canonical RL environments: +1. [FrozenLake-v1](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) (non-slippery version, default map), 4x4 gridworld where the agent has to reach the goal without falling into traps. +2. [CliffWalking-v0](https://gymnasium.farama.org/environments/toy_text/cliff_walking/). 4x12 gridworld where the agent has to reach the other side of the map without falling off a cliff. +3. [BanditTwoArmedHighLowFixed-v1](https://github.com/james-aung/gymasium-bandits). Stochastic two-armed bandit setup where Arm 1 pays out 80% of the time with reward 1, and Arm 2 pays out 20% of the time with reward 1. +4. [BanditTenArmedRandomFixed-v1](https://github.com/james-aung/gymasium-bandits). Stochastic ten-armed bandit setup where each arm has some randomly-initialized probability of payout. + +Besides these four environments, our eval is also built to be compatible with any environments that have discrete action and observation spaces using the Gymnasium API. Future work may generalize our eval to work with environments with other types of action/observation spaces. + +## Evaluation Process + +Each run of the eval tests the model on all four environments in the dataset, and has the model take steps in each environment until 200 steps are taken or the model’s context limit is reached. + +At each step, the eval provides the following to the model: +- The next observation and the reward from the last action. The model is also told when the environment has reset due to its action leading to a termination. +- How many of the maximum number of steps it has already taken. +- The total reward it has accumulated so far across all episodes. + +If an episode ends, the environment resets and a new episode begins. + +If the eval receive 4 responses in a row where we cannot parse an action selection, we end the evaluation for that environment. (This provides a natural end for runs where the model’s context window is exceeded.) + + +## Prompts + +We refer readers to the [`./defaults.py`](./defaults.py) file for the `TASK_DESCRIPTION` and other prompts used in the eval. + +## Metrics + +We provide the following metrics per evaluated environment: + +| **Metric** | **Notes** | +|---|---| +| `average_episode_reward` | The average reward achieved per episode | +| `total_steps` | The number of steps taken across all episodes before the environment sample ended | +| `invalid_response_rate` | % of responses that were in an invalid format for the eval | + + +## Token Usage Estimates + + +| Model | Token Usage Per Run | +|---|---| +| **gpt-3.5-turbo** | 4200000 ± 400000 | +| **gpt-4-turbo-preview** | 21900000 ± 10100000 | +| **mixtral-8x7b** | 2700000 ± 800000 | + + +## Future modifications + +- Extend the eval to work with other observation and action spaces beyond Discrete spaces + +## Version History + +- v0: Initial version released + +## Contribution Statement + +Eval design, implementation, and results evaluation were primarily conducted by James Aung. Chan Jun Shern was responsible for code reviews throughout the implementation process, along with fine-grained feedback on the project in general. Additional guidance was provided by Steven Adler, who scoped and managed the broader research project, including input on evaluation design, results analysis, and interpretation. \ No newline at end of file diff --git a/evals/elsuite/incontext_rl/anti-cot_solver.py b/evals/elsuite/incontext_rl/anti-cot_solver.py new file mode 100644 index 0000000000..40b3997e3c --- /dev/null +++ b/evals/elsuite/incontext_rl/anti-cot_solver.py @@ -0,0 +1,38 @@ +from typing import Any +from evals.solvers.solver import NestedSolver, Solver, SolverResult, SolverSpec +from evals.task_state import Message, TaskState + +ANTI_COT_TEMPLATE = "RESPOND ONLY WITH YOUR FINAL ANSWER IN THE FORMAT REQUESTED. DO NOT OUTPUT ANY ADDITIONAL REASONING OR TEXT." + +class AntiCoTSolver(NestedSolver): + """ + Instructs the model to not do any further reasoning and just respond with the final answer. + """ + + def __init__( + self, + solver: SolverSpec, + registry: Any = None, + ): + super().__init__(solver=solver) + + @property + def solver(self) -> Solver: + return self.get_solver("solver") + + def _solve( + self, + task_state: TaskState, + **kwargs, + ) -> SolverResult: + task_state.messages += ( + [ + Message(role="system", content=ANTI_COT_TEMPLATE), + ] + ) + solver_result = self.solver(task_state=task_state, **kwargs) + return solver_result + + @property + def name(self) -> str: + return f"Anti-CoT_{self.solver.name}" diff --git a/evals/elsuite/incontext_rl/baselines.py b/evals/elsuite/incontext_rl/baselines.py new file mode 100644 index 0000000000..34f65d6caf --- /dev/null +++ b/evals/elsuite/incontext_rl/baselines.py @@ -0,0 +1,118 @@ +import random + +import numpy as np + +from evals.elsuite.incontext_rl.eval import CurrentState +from evals.record import record_sampling +from evals.solvers.solver import Solver, SolverResult +from evals.task_state import TaskState + + +class RandomSolver(Solver): + def __init__(self, *args, **kwargs): + pass + + def _solve( + self, + task_state: TaskState, + **kwargs, + ) -> SolverResult: + + cs: CurrentState = task_state.current_state + + try: + action = cs.action_space.sample() + response = f"[SELECT: {action}]" + except Exception as e: + response = f"Error: {e}" + + record_sampling( + prompt=cs.observations[-1], + sampled=response, + model="incontext_rl_random", + ) + + return SolverResult(response) + + +class QlearningSolver(Solver): + def __init__( + self, + learning_rate=0.7, + gamma=0.95, + epsilon=1.0, + min_epsilon=0.05, + max_epsilon=1.0, + decay_rate=0.0005, + *args, + **kwargs, + ): + super().__init__(*args, **kwargs) + self.learning_rate = learning_rate + self.gamma = gamma + self.epsilon = epsilon + self.min_epsilon = min_epsilon + self.max_epsilon = max_epsilon + self.decay_rate = decay_rate + self.q_table = None + + def initialize_q_table(self, observation_space_size, action_space_size): + self.q_table = np.zeros((observation_space_size, action_space_size)) + + def select_action(self, state, action_space): + if random.uniform(0, 1) < self.epsilon: + return action_space.sample() # Explore action space + else: + return np.argmax(self.q_table[state][:]) # Exploit learned values + + def update_q_table(self, state, action, reward, next_state): + next_max = np.max(self.q_table[next_state]) + self.q_table[state, action] = self.q_table[state, action] + self.learning_rate * ( + reward + self.gamma * next_max - self.q_table[state, action] + ) + + def reduce_epsilon(self, episode_number): + self.epsilon = self.min_epsilon + (self.max_epsilon - self.min_epsilon) * np.exp( + -self.decay_rate * episode_number + ) + + def _solve(self, task_state: TaskState, **kwargs) -> SolverResult: + + cs: CurrentState = task_state.current_state + + # TODO these might not be true if environment is not discrete + assert ( + cs.observation_space_n is not None + ), "Environment must have discrete observation space" + assert cs.action_space_n is not None, "Environment must have discrete action space" + + if self.q_table is None: + print("Initializing Q-table") + self.initialize_q_table( + observation_space_size=cs.observation_space_n, action_space_size=cs.action_space_n + ) + + # This shouln't run on the first step + if len(cs.actions) >= 1 and len(cs.rewards) >= 1 and len(cs.observations) >= 2: + print(cs.actions) + self.update_q_table( + state=cs.observations[-2], + action=cs.actions[-1], + reward=cs.rewards[-1], + next_state=cs.observations[-1], + ) + print( + f"The last action {cs.actions[-1]} resulted in reward {cs.rewards[-1]}. We went from state {cs.observations[-2]} to state {cs.observations[-1]}" + ) + self.reduce_epsilon(episode_number=len(cs.episode_end_steps)) + + action = self.select_action(state=cs.observations[-1], action_space=cs.action_space) + response = f"[SELECT: {action}]" + + record_sampling( + prompt=cs.observations[-1], + sampled=response, + model="incontext_rl_qlearning", + ) + + return SolverResult(response) diff --git a/evals/elsuite/incontext_rl/defaults.py b/evals/elsuite/incontext_rl/defaults.py new file mode 100644 index 0000000000..2f1cb06fc2 --- /dev/null +++ b/evals/elsuite/incontext_rl/defaults.py @@ -0,0 +1,30 @@ +from string import Template + +task_description_template = Template("""$explanations + +You have the following $action_space actions available to you: $actions. + +On each turn, you will receive an observation from the environment, which maybe be one of the following $observation_space possible observations: +$observations +Each action you take will give you a reward (which might be negative) and may move you to a new state. + +You should take actions such that you accumulate the highest possible reward across all your actions. + +This is how we will play the game: +1. You will consider all the information you have received to decide upon the best action you can take now. +2. You will select your action from the list above by specifying the number key of the action in the command [SELECT: x], where x is the number key of the action. +3. Your selected action will be taken. +4. As a result of your action, you will be given an observation from the environment and you may receive some reward. +5. Repeat from step 1. +""") + +step_counter = Template("Total actions taken so far: $step_count") +reward_counter = Template("Total reward so far: $reward_count") +reset_msg = Template("""After the game reset you are now in $observation. +Please pick an action, providing your reasoning. You must format your final action choice as [SELECT: x]""") +step_result = Template("""You took Action $action. You are now in $next_observation. +The last step you did provided reward: $reward. +Please pick an action, providing your reasoning. You must format your final action choice as [SELECT: x]""") +step_result_reset = Template("""You took Action $action. You arrived at $next_observation. +The last step made the game reset. +The last step you did provided reward: $reward.""") \ No newline at end of file diff --git a/evals/elsuite/incontext_rl/env_setup.py b/evals/elsuite/incontext_rl/env_setup.py new file mode 100644 index 0000000000..31ffcba534 --- /dev/null +++ b/evals/elsuite/incontext_rl/env_setup.py @@ -0,0 +1,12 @@ +""" +Optional setup scripts for specific environments. +""" + +def setup_GymnasiumBandits(): + import gymnasium_bandits + return + +ENV_SETUP_FUNCS = { + "BanditTwoArmedHighLowFixed-v0": setup_GymnasiumBandits, + "BanditTenArmedRandomFixed-v0": setup_GymnasiumBandits, +} \ No newline at end of file diff --git a/evals/elsuite/incontext_rl/eval.py b/evals/elsuite/incontext_rl/eval.py new file mode 100644 index 0000000000..a1fac2101e --- /dev/null +++ b/evals/elsuite/incontext_rl/eval.py @@ -0,0 +1,299 @@ +import logging +import random +import re +from dataclasses import dataclass, field +from typing import Any, List, Optional + +import gymnasium as gym +import numpy as np + +import evals +from evals.api import CompletionFn +from evals.elsuite.incontext_rl.defaults import ( + reset_msg, + reward_counter, + step_counter, + step_result, + step_result_reset, + task_description_template, +) +from evals.elsuite.incontext_rl.env_setup import ENV_SETUP_FUNCS +from evals.eval import SolverEval +from evals.solvers.solver import Solver +from evals.task_state import Message, TaskState + +logger = logging.getLogger(__name__) + + +@dataclass +class CurrentState: + action_space: gym.Space + observation_space: gym.Space + action_space_n: int + observation_space_n: int + invalid_responses: int = 0 + total_responses: int = 0 + actions: List = field(default_factory=list) + rewards: List[float] = field(default_factory=list) + observations: List = field(default_factory=list) + episode_end_steps: List[int] = field(default_factory=list) + + +class InContextRl(SolverEval): + def __init__( + self, + completion_fns: list[CompletionFn], + max_steps: int = 200, # maximum possible steps per sample, optional + max_invalid_responses: int = 4, # maximum invalid responses from Solver before terminating sample + max_num_messages_allowed: int = 2048, # maximum number of messages allowed by OpenAI API + use_explanations: bool = False, # Whether to include a key for how to understand action and observation spaces + *args, + **kwargs, + ): + super().__init__(completion_fns, *args, **kwargs) + self.max_steps = max_steps + self.max_invalid_responses = max_invalid_responses + self.use_explanations = use_explanations + self.max_num_messages_allowed = max_num_messages_allowed + + def eval_sample(self, solver: Solver, sample: Any, rng: random.Random): + + # Validate sample + required_keys = ["env", "env_id", "explanations"] + assert all( + key in sample for key in required_keys + ), f"Sample missing required keys: {required_keys}" + assert isinstance(sample["env"], gym.Env) + assert isinstance(sample["env_id"], str) + assert isinstance(sample["explanations"], str) + + env = sample["env"] + ts = TaskState( + task_description=self._generate_task_description(env, sample), + messages=[], + current_state=CurrentState( + action_space=env.action_space, + observation_space=env.observation_space, + action_space_n=env.action_space.n, # TODO might not be available for all envs, check when adding a continuous env + observation_space_n=env.observation_space.n, # TODO might not be available for all envs, check when adding a continuous env + ), + ) + + # Reset environment and update task state + observation, _ = env.reset(seed=42) + ts.current_state.observations.append(observation) + + # Tell model starting observation and ask it to pick an action + self._add_reset_message_to_task_state(ts, observation, sample) + + for _ in range(self.max_steps): + self._add_recap_message_to_task_state( + ts, ts.current_state.actions, ts.current_state.rewards + ) + + action = self._try_get_valid_action(solver, ts, env.action_space.n) + + if action is None: + logger.info("Ending sample since couldn't parse an action.") + break + else: + next_observation, reward, terminated, truncated, _ = env.step(action) + ts.current_state.actions.append(action) + ts.current_state.rewards.append(float(reward)) + ts.current_state.observations.append(next_observation) + + if terminated or truncated: + # Tell model that episode ended and what reward was received + content = self._format_step_message( + action, next_observation, reward, sample, terminated=True + ) + ts.messages += [Message(role="user", content=content)] + + # Log what step the episode ended on + ts.current_state.episode_end_steps.append(len(ts.current_state.actions)) + + # Reset environment + observation, _ = env.reset(seed=42) + ts.current_state.observations.append(observation) + + # Tell model new observation after reset and ask it to pick an action + self._add_reset_message_to_task_state(ts, observation, sample) + else: + content = self._format_step_message(action, next_observation, reward, sample) + ts.messages += [Message(role="user", content=content)] + + env.close() + + episode_rewards = self._calculate_episode_rewards( + ts.current_state.episode_end_steps, ts.current_state.rewards + ) + evals.record.record_metrics( + environment=f"{env.spec.id} {env.spec.kwargs}", + explanations=self.use_explanations, + total_return=sum(ts.current_state.rewards), + total_steps=len(ts.current_state.actions), + actions=ts.current_state.actions, + rewards=ts.current_state.rewards, + episode_rewards=episode_rewards, + average_episode_reward=float(np.mean(episode_rewards)), + average_reward_last_5_episodes=float(np.mean(episode_rewards[-5:])), + average_reward_last_10_episodes=float(np.mean(episode_rewards[-10:])), + average_reward_last_20_episodes=float(np.mean(episode_rewards[-20:])), + average_reward_last_50_episodes=float(np.mean(episode_rewards[-50:])), + invalid_response_rate=ts.current_state.invalid_responses + / ts.current_state.total_responses + if ts.current_state.total_responses > 0 + else 0, + episode_end_steps=ts.current_state.episode_end_steps, + ) + + def run(self, recorder: evals.record.Recorder): + samples = self.get_samples() + for sample in samples: + # Create environments and pass them to each thread via the sample + # (gym envs don't like being created in the thread itself) + sample["env"] = self._make_env(sample) + self.eval_all_samples(recorder, samples) + + metrics = recorder.get_metrics() + + results = [] + + for metric in metrics: + env_result = { + "env": metric["environment"], + "metrics": { + "explanations": metric["explanations"], + "average_episode_reward": metric["average_episode_reward"], + "average_reward_last_5_episodes": metric["average_reward_last_5_episodes"], + "average_reward_last_10_episodes": metric["average_reward_last_10_episodes"], + "average_reward_last_20_episodes": metric["average_reward_last_20_episodes"], + "average_reward_last_50_episodes": metric["average_reward_last_50_episodes"], + "episode_rewards": metric["episode_rewards"], + "total_return": metric["total_return"], + "total_steps": metric["total_steps"], + "actions": metric["actions"], + "rewards": metric["rewards"], + "invalid_response_rate": metric["invalid_response_rate"], + "episode_end_steps": metric["episode_end_steps"], + }, + } + results.append(env_result) + + final_result = {"environments": results} + return final_result + + def _make_env(self, sample: dict) -> gym.Env: + env_id = sample["env_id"] + env_args = sample.get("env_args", {}) + if env_id in ENV_SETUP_FUNCS: + # Optional setup scripts for specific environments + ENV_SETUP_FUNCS[env_id]() + return gym.make(env_id, **env_args) + + def _generate_task_description(self, env: gym.Env, sample: dict) -> str: + + actions = [str(action) for action in range(env.action_space.n)] + observations = [ + f"Observation {observation}" for observation in range(env.observation_space.n) + ] + explanations = ( + sample["explanations"] if self.use_explanations else "You are playing a game." + ) + + return task_description_template.substitute( + action_space=env.action_space.n, + actions=actions, + observation_space=env.observation_space.n, + observations=observations, + explanations=explanations, + ) + + def _try_get_valid_action( + self, solver: Solver, task_state: TaskState, action_space: int + ) -> Optional[int]: + number_of_attempts = 0 + while number_of_attempts < self.max_invalid_responses: + if len(task_state.messages) > self.max_num_messages_allowed: + logger.info( + f"Exceeded maximum number of messages allowed ({self.max_num_messages_allowed})." + ) + return None + solver_response = solver(task_state).output + action = self._parse_action(solver_response) + task_state.messages += [Message(role="assistant", content=solver_response)] + task_state.current_state.total_responses += 1 + # Check if action is valid + if action not in range( + action_space + ): # TODO this might not work for non-discrete action spaces, check with more complex env + task_state.messages += [ + Message( + role="user", + content="Invalid action. Please provide ONE valid action by outputting your selection in the format [SELECT: x]. Only output this selection ONCE.", + ) + ] + task_state.current_state.invalid_responses += 1 + number_of_attempts += 1 + else: + return action + # If the loop exits due to reaching max invalid attempts, log and return None + logger.info(f"Exceeded maximum invalid action attempts ({self.max_invalid_responses}).") + return None + + def _parse_action(self, raw_response: str) -> Optional[int]: + pattern = r"\[SELECT: (\d+)\]" + matches = re.findall(pattern, raw_response) + + actions = [int(match) for match in matches] + if not actions: + logger.info(f"No action selections found in response: {raw_response}") + return None + if len(actions) > 1: + logger.info(f"Multiple action selections found in response: {raw_response}") + return None + return actions[0] + + def _add_message_to_task_state(self, task_state: TaskState, role: str, content: str) -> None: + """ + Adds a message to the task state, combining it with the previous message if they are from the same role. + """ + if task_state.messages and task_state.messages[-1].role == role: + task_state.messages[-1].content += "\n\n" + content + else: + task_state.messages.append(Message(role=role, content=content)) + + def _add_reset_message_to_task_state( + self, task_state: TaskState, observation: int, sample: dict + ) -> None: + content = reset_msg.substitute(observation=f"Observation {observation}") + self._add_message_to_task_state(task_state, "user", content) + + def _add_recap_message_to_task_state( + self, task_state: TaskState, actions: List, rewards: List[float] + ) -> None: + step_count = step_counter.substitute(step_count=len(actions)) + reward_count = reward_counter.substitute(reward_count=sum(rewards)) + content = "\n".join([step_count, reward_count]) + self._add_message_to_task_state(task_state, "user", content) + + def _format_step_message( + self, action: int, observation: int, reward: float, sample: dict, terminated: bool = False + ) -> str: + observation_desc = f"Observation {observation}" + if terminated: + template = step_result_reset + else: + template = step_result + return template.substitute(action=action, next_observation=observation_desc, reward=reward) + + def _calculate_episode_rewards(self, episode_end_steps, rewards): + episode_rewards = [] + if not episode_end_steps: # Handle case where there was only 1 episode + return [sum(rewards)] + start_index = 0 + for end_index in episode_end_steps: + episode_reward = sum(rewards[start_index:end_index]) + episode_rewards.append(episode_reward) + start_index = end_index + return episode_rewards diff --git a/evals/elsuite/incontext_rl/requirements.txt b/evals/elsuite/incontext_rl/requirements.txt new file mode 100644 index 0000000000..2712d1140b --- /dev/null +++ b/evals/elsuite/incontext_rl/requirements.txt @@ -0,0 +1,3 @@ +# Additional requirements for specific environments +gymnasium +git+https://github.com/james-aung/gymnasium-bandits \ No newline at end of file diff --git a/evals/elsuite/incontext_rl/scripts/plot_experiments.py b/evals/elsuite/incontext_rl/scripts/plot_experiments.py new file mode 100644 index 0000000000..9e8e27f82b --- /dev/null +++ b/evals/elsuite/incontext_rl/scripts/plot_experiments.py @@ -0,0 +1,363 @@ +import json +import numpy as np +import matplotlib.pyplot as plt +from scipy.stats import sem +import pandas as pd +from pathlib import Path +import matplotlib.colors as mcolors +import argparse +import seaborn as sns + +from evals.utils.log_utils import extract_spec, get_final_results_from_dir + +WINDOW_SIZES = { + "FrozenLake-v1 {'map_name': '4x4', 'is_slippery': False}": 20, + "BanditTwoArmedHighLowFixed-v0 {}": 40, + "BanditTenArmedRandomFixed-v0 {}": 40, + "CliffWalking-v0 {}": 20, + "FrozenLake-v1 {'map_name': '4x4', 'is_slippery': False, 'desc': ['SHFF', 'FFFF', 'FFGH', 'HFHF']}": 20, + "default": 20, +} + +PRETTY_MODEL_NAMES = { + 'generation/direct/gpt-4-turbo-preview': 'GPT-4 Turbo Preview', + 'incontext_rl/random': 'Random Strategy', + 'generation/direct/gpt-3.5-turbo': 'GPT-3.5 Turbo', + 'incontext_rl/qlearning_scratch': 'Q-Learning from scratch', + 'incontext_rl/qlearning_trained': 'Q-Learning trained', + 'generation/direct/gemini-pro': 'Gemini Pro 1.0', + 'generation/direct/mixtral-8x7b-instruct': 'Mixtral 8x7b', +} + +PRETTY_ENV_TITLES = { + "FrozenLake-v1 {'map_name': '4x4', 'is_slippery': False}": 'Frozen Lake (4x4, Non-slippery)', + "BanditTwoArmedHighLowFixed-v0 {}": "Two-Armed Bandit", + "BanditTenArmedRandomFixed-v0 {}": "Ten-Armed Bandit", + "CliffWalking-v0 {}": "Cliff Walking", + "FrozenLake-v1 {'map_name': '4x4', 'is_slippery': False, 'desc': ['SFFF', 'FHFH', 'FFFH', 'GFFH']}": 'Frozen Lake Custom Map (4x4, Non-slippery)', +} + +MODEL_STYLES = { + 'generation/direct/gpt-4-turbo-preview': {'line_style': '-', 'color': 'purple', 'alpha': 0.7}, + 'incontext_rl/random': {'line_style': ':', 'color': 'grey', 'alpha': 0.7}, + 'generation/direct/gpt-3.5-turbo': {'line_style': '-', 'color': 'green', 'alpha': 0.7}, + 'incontext_rl/qlearning_scratch': {'line_style': '--', 'color': 'grey', 'alpha': 0.7}, + 'incontext_rl/qlearning_trained': {'line_style': '-', 'color': 'black', 'alpha': 0.7}, + 'generation/direct/gemini-pro': {'line_style': '-', 'color': 'blue', 'alpha': 0.7}, + 'generation/direct/mixtral-8x7b-instruct': {'line_style': '-', 'color': 'orange', 'alpha': 0.7}, + 'default': {'line_style': '-', 'color': 'black', 'alpha': 0.5}, +} + +def calculate_episode_rewards(row: pd.Series) -> list: + """ + Calculate the rewards for each episode based on the episode end steps and rewards. + """ + episode_end_steps = row['episode_end_steps'] + rewards = row['rewards'] + episode_rewards = [] + if not episode_end_steps: # Handle case where there was only 1 episode + return [sum(rewards)] + start_index = 0 + for end_index in episode_end_steps: + episode_reward = sum(rewards[start_index:end_index]) + episode_rewards.append(episode_reward) + start_index = end_index + return episode_rewards + +def calculate_rolling_average(episode_rewards: list, window_size: int) -> list: + """ + Calculate the rolling average of the episode rewards using a specified window size. + """ + window_size = int(window_size) + rolling_averages = [] + for i in range(len(episode_rewards)): + # Calculate the start index for the window; ensure it's not negative + start_index = max(0, i - window_size + 1) + # Calculate the running average for the current window + window_average = np.mean(episode_rewards[start_index:i+1]) + rolling_averages.append(window_average) + return rolling_averages + +def calculate_custom_episode_end_steps_for_cliffwalking(rewards: list, existing_end_steps: list) -> list: + """ + Calculate episode end steps based on rewards and append to existing end steps. + An episode also ends when the reward is -100 i.e. when the agent falls off the cliff. + + Args: + rewards (list): List of rewards for each step in an episode. + existing_end_steps (list): List of already identified episode end steps. + + Returns: + list: Updated list of indices representing the end of each episode. + """ + new_end_steps = [i + 1 for i, reward in enumerate(rewards) if reward == -100] + # Combine existing and new end steps, remove duplicates, and sort + combined_end_steps = sorted(set(existing_end_steps + new_end_steps)) + return combined_end_steps + +def extract_results(datadir: Path) -> pd.DataFrame: + """ + Extracts results from the specified directory and returns a DataFrame. + + Args: + datadir (Path): Path to the directory containing the experiment results. + + Returns: + pd.DataFrame: DataFrame containing the experiment results. + """ + print(f"Extracting results from directory: {datadir}") + df_rows = [] + final_results = get_final_results_from_dir(datadir) + if not final_results: + print("No results found in directory.") + raise ValueError("No results found in directory.") + + for path, results in final_results.items(): + print(f"Processing file: {path}") + spec = extract_spec(path) + if not spec: + raise ValueError(f"No spec found for {path}") + model = spec.get("completion_fns", [None])[0] + base_eval = spec.get("base_eval") + if not model or base_eval is None: + raise ValueError(f"Missing model or base_eval in spec for {path}") + + environments = results.get('environments', []) + for env in environments: + metrics = env.get('metrics', {}) + flattened_metrics = {f"{k}": v for k, v in metrics.items()} # Flatten metrics into separate columns + print(f"Extracted {env['env']} metrics for model: {model}") + + # Calculate custom episode end steps for CliffWalking environment + if env['env'] == "CliffWalking-v0 {}": + rewards = metrics.get('rewards', []) + existing_end_steps = metrics.get('episode_end_steps', []) + episode_end_steps = calculate_custom_episode_end_steps_for_cliffwalking(rewards, existing_end_steps) + flattened_metrics['episode_end_steps'] = episode_end_steps + + df_rows.append({"model": model, "base_eval": base_eval, "environment": env['env'], **flattened_metrics}) + + df = pd.DataFrame(df_rows) + + if 'episode_rewards' not in df.columns: + df['episode_rewards'] = df.apply(calculate_episode_rewards, axis=1) + + # For plots + df['cumulative_episode_rewards'] = df['episode_rewards'].apply(np.cumsum) + df['average_episode_reward'] = df['episode_rewards'].apply(np.mean) + df['window_size'] = df['environment'].map(WINDOW_SIZES).fillna(WINDOW_SIZES.get('default', 20)) + df['rolling_average_rewards'] = df.apply(lambda row: calculate_rolling_average(row['episode_rewards'], row['window_size']), axis=1) + + # We also calculate the rolling average across different window sizes + df['rolling_average_rewards_5_episodes'] = df.apply(lambda row: calculate_rolling_average(row['episode_rewards'], 5), axis=1) + df['rolling_average_rewards_10_episodes'] = df.apply(lambda row: calculate_rolling_average(row['episode_rewards'], 10), axis=1) + df['rolling_average_rewards_20_episodes'] = df.apply(lambda row: calculate_rolling_average(row['episode_rewards'], 20), axis=1) + df['rolling_average_rewards_50_episodes'] = df.apply(lambda row: calculate_rolling_average(row['episode_rewards'], 50), axis=1) + + # We also calculate the average reward for the last 5, 10, 20, and 50 episodes. For older runs, we may not have this information. + if 'average_reward_last_5_episodes' not in df.columns: + df['average_reward_last_5_episodes'] = df['episode_rewards'].apply(lambda rewards: np.mean(rewards[-5:])) + if 'average_reward_last_10_episodes' not in df.columns: + df['average_reward_last_10_episodes'] = df['episode_rewards'].apply(lambda rewards: np.mean(rewards[-10:])) + if 'average_reward_last_20_episodes' not in df.columns: + df['average_reward_last_20_episodes'] = df['episode_rewards'].apply(lambda rewards: np.mean(rewards[-20:])) + if 'average_reward_last_50_episodes' not in df.columns: + df['average_reward_last_50_episodes'] = df['episode_rewards'].apply(lambda rewards: np.mean(rewards[-50:])) + + print(f"Extraction complete. {len(df_rows)} rows in DataFrame.") + return df + +def plot_rewards(df, environment, reward_type, out_dir, window_size=None): + """ + Generalized function to plot episode, cumulative, or running average rewards for different models + on the same graph for a specific environment. It automatically determines the plot type (line or scatter) + based on the number of episodes and includes the 95% confidence intervals for line plots. + + Args: + df (pd.DataFrame): DataFrame containing the experiment results. + environment (str): Name of the environment to plot. + reward_type (str): Type of reward to plot. Must be one of 'episode_rewards', 'cumulative_episode_rewards', or 'rolling_average_rewards'. + out_dir (Path): Path to the directory to save the plots. + window_size (int): Window size for calculating rolling averages. If None, the window size will be determined based on the environment. + """ + valid_reward_types = ['episode_rewards', 'cumulative_episode_rewards', 'rolling_average_rewards'] + if reward_type not in valid_reward_types: + raise ValueError(f"Invalid reward_type. Expected one of {valid_reward_types}, got {reward_type}") + + # Filter the DataFrame for the specific environment + filtered_df = df[df['environment'] == environment] + + # Explode the specified reward list into separate rows and prepare for plotting + rewards_df = filtered_df.explode(reward_type).reset_index() # Each row will be a single episode + rewards_df['episode'] = rewards_df.groupby(['model', 'index']).cumcount() + 1 # Add episode number as a column + rewards_df['reward'] = rewards_df[reward_type] # Rename the column for clarity + + truncate_per_model = True + if environment == "CliffWalking-v0 {}": + truncate_per_model = False # Hacky workaround to make better plots since some models only have 1 episode on CliffWalking + + if truncate_per_model: + filtered_rewards_df = pd.DataFrame() + for model, group in rewards_df.groupby('model'): + # Count the number of runs for each episode number + episode_counts = group.groupby('episode').size() + # Check if there are at least 3 runs for any episode number + if episode_counts.max() >= 3: + # Find the maximum episode number where at least 3 runs are available + max_episode_with_at_least_3_runs = episode_counts[episode_counts >= 3].index.max() + # Filter the group DataFrame to only include data up to this episode number + model_filtered = group[group['episode'] <= max_episode_with_at_least_3_runs] + else: + # If there are fewer than 3 runs for all episodes, include all data for this model + model_filtered = group + # Append the filtered data for the current model to the overall filtered DataFrame + filtered_rewards_df = pd.concat([filtered_rewards_df, model_filtered], ignore_index=True) + rewards_df = filtered_rewards_df + + plt.figure(figsize=(10, 5)) + ax = plt.gca() + + # Determine the plot type based on the number of episodes + num_episodes = len(rewards_df['episode'].unique()) + if num_episodes > 1: + # Iterate over each unique model in the DataFrame + for model in rewards_df['model'].unique(): + # Filter the DataFrame for the current model + model_df = rewards_df[rewards_df['model'] == model] + # Get the custom style for the current model using the helper function + custom_style = MODEL_STYLES.get(model, MODEL_STYLES['default']) + pretty_model_name = PRETTY_MODEL_NAMES.get(model, model) + # Plot the data for the current model on the same axes with custom settings + lineplot = sns.lineplot(data=model_df, x='episode', y='reward', estimator='mean', errorbar=('ci', 95), + linestyle=custom_style['line_style'], color=custom_style['color'], + alpha=custom_style['alpha'], label=pretty_model_name, ax=ax, + err_kws={'alpha': 0.035}) + # Add labels to the final value on the x axis + for line in lineplot.get_lines(): + x, y = line.get_data() + if len(x) > 0: # Check if there is data to plot + ax.text(x[-1], y[-1], f"{y[-1]:.2f}", color=line.get_color(), fontsize=9) + else: + # For a single episode, use scatter plot, differentiating models by color + scatterplot = sns.scatterplot(data=rewards_df, x='episode', y='reward', hue='model', ax=ax) + # Add labels to the final value on the x axis + for line in scatterplot.collections: + offsets = line.get_offsets() + if offsets.size > 0: # Check if there are points to plot + last_point = offsets[-1] + ax.text(last_point[0], last_point[1], f"{last_point[1]:.2f}", fontsize=9) + + pretty_env_title = PRETTY_ENV_TITLES.get(environment, environment) + plt.title(f'{reward_type.replace("_", " ").title()} in {pretty_env_title} (Window Size: {window_size})' if reward_type == 'rolling_average_rewards' else f'{reward_type.replace("_", " ").title()} in {pretty_env_title}') + plt.xlabel('Episode') + plt.ylabel('Reward') + plt.legend(title='Model', bbox_to_anchor=(1.05, 1), loc='upper left') + plt.xlim(1, num_episodes) + plt.tight_layout() + plot_dir = out_dir / reward_type + plot_dir.mkdir(parents=True, exist_ok=True) + plt.savefig(plot_dir / f'{environment}.png') + plt.show() + +def calculate_rolling_averages(df: pd.DataFrame, max_items: int = 200): + """ + Calculate the averaged final and max rolling averages for the first N items in each model and environment. + Args: + df (pd.DataFrame): DataFrame containing the experiment results. + max_items (int): Maximum number of items to consider for calculating rolling averages. + Returns: + dict: Dictionary containing the averaged final and max rolling averages for each model and environment. + """ + + model_env_averages_info = {} + for model in df['model'].unique(): + model_df = df[df['model'] == model] + model_env_averages_info[model] = {} + all_final_rolling_averages = [] # To store all final rolling averages across environments for each model + for env in model_df['environment'].unique(): + env_df = model_df[model_df['environment'] == env] + # Determine the last shared episode across all runs for the current model and environment, limited to the first max_items items + max_shared_episode = min(max_items, env_df['rolling_average_rewards'].apply(lambda rewards: len(rewards[:max_items])).min()) + # Truncate each run's rolling_average_rewards to the max shared episode and then calculate averages + truncated_averages = env_df['rolling_average_rewards'].apply(lambda rewards: rewards[:max_shared_episode]) + final_rolling_averages = round(truncated_averages.apply(lambda rewards: rewards[-1] if len(rewards) > 0 else None).mean(), 2) + max_rolling_averages = round(truncated_averages.apply(lambda rewards: max(rewards) if len(rewards) > 0 else None).mean(), 2) + + all_final_rolling_averages.append(final_rolling_averages) # Append the final rolling average for the current environment + + model_env_averages_info[model][env] = { + 'average_final_rolling_averages': final_rolling_averages, + 'average_max_rolling_averages': max_rolling_averages, + } + + # Calculate the average final rolling average across all environments for the current model + average_final_across_envs = round(sum(all_final_rolling_averages) / len(all_final_rolling_averages), 2) if all_final_rolling_averages else None + model_env_averages_info[model]['average_final_rolling_averages_across_envs'] = average_final_across_envs + return model_env_averages_info + +def json_of_results(df: pd.DataFrame, out_dir: Path): + """ + JSON dump of the results. + + Each model will have the following information, grouping by environment: + - Average episode reward + - Last rolling average reward for each of 5, 10, 20, and 50 episodes + - Max rolling average reward across the 5, 10, 20, and 50 episodes + - Invalid response rate + + Where there are multiple runs for a model and environment, the average of the above values will be calculated. + """ + + model_info = {} + for model in df['model'].unique(): + model_df = df[df['model'] == model] + model_info[model] = {} + for env in model_df['environment'].unique(): + env_df = model_df[model_df['environment'] == env] + # Calculate the average rolling averages across all runs for each window size, then find the max + average_rolling_averages_5 = env_df['rolling_average_rewards_5_episodes'].apply(pd.Series).mean().max() + average_rolling_averages_10 = env_df['rolling_average_rewards_10_episodes'].apply(pd.Series).mean().max() + average_rolling_averages_20 = env_df['rolling_average_rewards_20_episodes'].apply(pd.Series).mean().max() + average_rolling_averages_50 = env_df['rolling_average_rewards_50_episodes'].apply(pd.Series).mean().max() + + model_info[model][env] = { + 'average_episode_reward': round(env_df['average_episode_reward'].mean(), 2), + 'average_reward_last_5_episodes': round(env_df['average_reward_last_5_episodes'].mean(), 2), + 'average_reward_last_10_episodes': round(env_df['average_reward_last_10_episodes'].mean(), 2), + 'average_reward_last_20_episodes': round(env_df['average_reward_last_20_episodes'].mean(), 2), + 'average_reward_last_50_episodes': round(env_df['average_reward_last_50_episodes'].mean(), 2), + 'max_rolling_average_rewards_5_episodes': round(average_rolling_averages_5, 2), + 'max_rolling_average_rewards_10_episodes': round(average_rolling_averages_10, 2), + 'max_rolling_average_rewards_20_episodes': round(average_rolling_averages_20, 2), + 'max_rolling_average_rewards_50_episodes': round(average_rolling_averages_50, 2), + 'invalid_response_rate': round(env_df['invalid_response_rate'].mean(), 2), + } + with open(out_dir / 'model_info.json', 'w') as f: + json.dump(model_info, f, indent=4) + +def main(log_dir: str = None, out_dir: str = None): + + parser = argparse.ArgumentParser() + parser.add_argument("--log_dir", "-d", type=str, required=not log_dir) + parser.add_argument("--out_dir", "-o", type=str, required=not out_dir) + args = parser.parse_args() + log_dir = Path(log_dir) if log_dir else Path(args.log_dir) + out_dir = Path(out_dir) if out_dir else Path(args.out_dir) + + # Extract results from directory + df = extract_results(log_dir) + + # # Plot episode rewards with 95% confidence intervals + for env in df['environment'].unique(): + plot_rewards(df, env, 'episode_rewards', out_dir) + plot_rewards(df, env, 'cumulative_episode_rewards', out_dir) + window_size = df[df['environment'] == env]['window_size'].iloc[0] + plot_rewards(df, env, 'rolling_average_rewards', out_dir, window_size) + + # JSON dump of the results + json_of_results(df, out_dir) + + +if __name__ == "__main__": + main() + diff --git a/evals/elsuite/incontext_rl/scripts/qlearning_baseline.ipynb b/evals/elsuite/incontext_rl/scripts/qlearning_baseline.ipynb new file mode 100644 index 0000000000..bb2dc7ae44 --- /dev/null +++ b/evals/elsuite/incontext_rl/scripts/qlearning_baseline.ipynb @@ -0,0 +1,402 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install gymnasium\n", + "!pip install numpy\n", + "!pip install git+https://github.com/james-aung/gymnasium-bandits\n", + "!pip install tqdm" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import gymnasium as gym\n", + "import random\n", + "import json\n", + "\n", + "import gymnasium_bandits" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [], + "source": [ + "# Training parameters\n", + "n_training_episodes = 10000 # Total training episodes\n", + "n_training_steps = 200 # Total training steps\n", + "learning_rate = 0.7 # Learning rate\n", + "\n", + "# Evaluation parameters\n", + "reward_window_size = 25 # Number of steps to consider when calculating average reward\n", + "\n", + "# Environment parameters\n", + "gamma = 0.95 # Discounting rate\n", + "\n", + "# Exploration parameters\n", + "max_epsilon = 1.0 # Exploration probability at start\n", + "min_epsilon = 0.05 # Minimum exploration probability\n", + "decay_rate = 0.0005 # Exponential decay rate for exploration prob" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [], + "source": [ + "def initialize_q_table(state_space, action_space):\n", + " Qtable = np.zeros((state_space, action_space))\n", + " return Qtable" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [], + "source": [ + "def greedy_policy(Qtable, state):\n", + " # Exploitation: take the action with the highest state, action value\n", + " action = np.argmax(Qtable[state][:])\n", + "\n", + " return action" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [], + "source": [ + "def epsilon_greedy_policy(Qtable, state, epsilon):\n", + " # Randomly generate a number between 0 and 1\n", + " random_num = random.uniform(0,1)\n", + " # if random_num > greater than epsilon --> exploitation\n", + " if random_num > epsilon:\n", + " # Take the action with the highest value given a state\n", + " # np.argmax can be useful here\n", + " action = greedy_policy(Qtable, state)\n", + " # else --> exploration\n", + " else:\n", + " action = env.action_space.sample()\n", + "\n", + " return action" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [], + "source": [ + "def train(n_training_steps, min_epsilon, max_epsilon, decay_rate, env, Qtable, reward_window_size=25):\n", + "\n", + " actions, rewards = [], []\n", + " total_steps = 0\n", + " episode_end_steps = []\n", + "\n", + " for _ in range(n_training_steps):\n", + " if total_steps >= n_training_steps:\n", + " break\n", + " # Reduce epsilon (because we need less and less exploration)\n", + " epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*len(episode_end_steps))\n", + " # Reset the environment\n", + " state, info = env.reset()\n", + " terminated = False\n", + " truncated = False\n", + "\n", + " while not terminated and not truncated and total_steps < n_training_steps:\n", + " # Choose the action At using epsilon greedy policy\n", + " action = epsilon_greedy_policy(Qtable, state, epsilon)\n", + "\n", + " # Take action At and observe Rt+1 and St+1\n", + " new_state, reward, terminated, truncated, info = env.step(action)\n", + "\n", + " actions.append(int(action))\n", + " rewards.append(float(reward))\n", + " total_steps += 1\n", + "\n", + " # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]\n", + " Qtable[state][action] = Qtable[state][action] + learning_rate * (reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action])\n", + "\n", + " # Our next state is the new state\n", + " state = new_state\n", + "\n", + " if terminated or truncated:\n", + " episode_end_steps.append(total_steps)\n", + "\n", + " training_summary = {\n", + " \"reward_window_size\": reward_window_size,\n", + " \"average_reward_at_end\": sum(rewards[-reward_window_size:])/reward_window_size,\n", + " \"total_reward\": sum(rewards),\n", + " \"total_steps\": len(actions),\n", + " \"actions\": list(actions),\n", + " \"rewards\": list(rewards),\n", + " \"episode_end_steps\": episode_end_steps,\n", + " }\n", + " \n", + " print(f\"Training completed for {env.spec.id} at {total_steps} steps.\")\n", + " \n", + " return training_summary" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [], + "source": [ + "def evaluate(env, Qtable, n_evaluation_steps=200):\n", + "\n", + " actions, rewards = [], []\n", + " total_steps = 0\n", + " episode_end_steps = []\n", + "\n", + " while total_steps < n_evaluation_steps:\n", + " # Reset the environment at the start of each new episode\n", + " state, info = env.reset()\n", + " episode_end_steps.append(total_steps)\n", + " terminated = False\n", + " truncated = False\n", + "\n", + " while not terminated and not truncated and total_steps < n_evaluation_steps:\n", + " # Choose the action At using greedy policy\n", + " action = greedy_policy(Qtable, state)\n", + "\n", + " # Take action At and observe Rt+1 and St+1\n", + " new_state, reward, terminated, truncated, info = env.step(action)\n", + "\n", + " actions.append(int(action))\n", + " rewards.append(float(reward))\n", + " total_steps += 1\n", + "\n", + " # Our next state is the new state\n", + " state = new_state\n", + "\n", + "\n", + " evaluation_summary = {\n", + " \"average_reward_at_end\": sum(rewards[-25:])/min(25, len(rewards)),\n", + " \"total_reward\": sum(rewards),\n", + " \"total_steps\": len(actions),\n", + " \"actions\": list(actions),\n", + " \"rewards\": list(rewards),\n", + " \"episode_end_steps\": episode_end_steps,\n", + " }\n", + " \n", + " print(f\"Evaluation completed for {env.spec.id} at {total_steps} steps.\")\n", + " \n", + " return evaluation_summary" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [], + "source": [ + "frozenlake = gym.make(\"FrozenLake-v1\", is_slippery=False)\n", + "frozenlakecustom = gym.make(\"FrozenLake-v1\", is_slippery=False, desc =['SFFF', 'FHFH', 'FFFH', 'GFFH'])\n", + "twobandits = gym.make(\"BanditTwoArmedHighLowFixed-v0\")\n", + "tenbandits = gym.make(\"BanditTenArmedRandomFixed-v0\")\n", + "cliffwalking = gym.make(\"CliffWalking-v0\")\n", + "\n", + "envs = [frozenlake, frozenlakecustom, twobandits, tenbandits, cliffwalking]\n" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Training FrozenLake-v1 with args {'map_name': '4x4', 'is_slippery': False}...\n", + "Training completed for FrozenLake-v1 at 200 steps.\n", + "Training FrozenLake-v1 with args {'map_name': '4x4', 'is_slippery': False, 'desc': ['SFFF', 'FHFH', 'FFFH', 'GFFH']}...\n", + "Training completed for FrozenLake-v1 at 200 steps.\n", + "Training BanditTwoArmedHighLowFixed-v0 with args {}...\n", + "Training completed for BanditTwoArmedHighLowFixed-v0 at 200 steps.\n", + "Training BanditTenArmedRandomFixed-v0 with args {}...\n", + "Training completed for BanditTenArmedRandomFixed-v0 at 200 steps.\n", + "Training CliffWalking-v0 with args {}...\n", + "Training completed for CliffWalking-v0 at 200 steps.\n" + ] + } + ], + "source": [ + "from datetime import datetime\n", + "\n", + "environment_results = []\n", + "\n", + "for env in envs:\n", + " print(f\"Training {env.spec.id} with args {env.spec.kwargs}...\")\n", + " Qtable = initialize_q_table(env.observation_space.n, env.action_space.n)\n", + " results = train(n_training_steps, min_epsilon, max_epsilon, decay_rate, env, Qtable, reward_window_size)\n", + " # train(n_training_steps, min_epsilon, max_epsilon, decay_rate, env, Qtable, reward_window_size)\n", + " # results = evaluate(env, Qtable)\n", + " env_result = {\"env\": f\"{env.spec.id} {env.spec.kwargs}\", \"metrics\": results}\n", + " environment_results.append(env_result)\n", + "\n", + "current_time = datetime.now().strftime(\"%Y-%m-%d %H:%M:%S\")\n", + "spec = {\"spec\": {\"completion_fns\": [\"incontext_rl/qlearning\"], \"eval_name\": \"incontext_rl.v0\", \"base_eval\": \"incontext_rl\", \"split\": \"v0\", \"created_at\": current_time}}\n", + "final_report = {\"final_report\": {\"environments\": environment_results}}\n", + "\n", + "with open('./logs/qlearning_incontext_rl.log', 'w') as f:\n", + " json.dump(spec, f)\n", + " f.write(\"\\n\")\n", + " json.dump(final_report, f)\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Plotting running average reward for FrozenLake-v1\n" + ] + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Plotting running average reward for FrozenLake-v1\n" + ] + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Plotting running average reward for BanditTwoArmedHighLowFixed-v0\n" + ] + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Plotting running average reward for BanditTenArmedRandomFixed-v0\n" + ] + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Plotting running average reward for CliffWalking-v0\n" + ] + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "def plot_running_average(rewards, window_size=10000):\n", + " running_avg = np.convolve(rewards, np.ones(window_size)/window_size, mode='valid')\n", + " plt.plot(running_avg)\n", + " plt.title('Running Average Reward Over Time')\n", + " plt.xlabel('Step Number')\n", + " plt.ylabel('Running Average Reward')\n", + " plt.show()\n", + "\n", + "# Assuming `rewards` is a list of rewards from each episode\n", + "for env in environment_results:\n", + " rewards = env[\"metrics\"][\"rewards\"]\n", + " print(f\"Plotting running average reward for {env['env_id']}\")\n", + " plot_running_average(rewards)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.10" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/evals/elsuite/incontext_rl/scripts/run_experiments.sh b/evals/elsuite/incontext_rl/scripts/run_experiments.sh new file mode 100755 index 0000000000..9d8765dc22 --- /dev/null +++ b/evals/elsuite/incontext_rl/scripts/run_experiments.sh @@ -0,0 +1,39 @@ +#!/bin/bash +logdir=./logs +outputdir=./outputs + +timestamp=$(date +%Y%m%d_%H%M%S) +logpathbase=$logdir/$timestamp/ + +mkdir -p ${logpathbase} + +echo Running experiments and logging to $logpathbase +read -p "Enter the number of runs: " num_runs + +set -x # Enable printing of each command before it's executed +# Random baselines +oaieval incontext_rl/random incontext_rl.v0 --record_path ${logpathbase}explanations/random.log +oaieval incontext_rl/random incontext_rl.raw.v0 --record_path ${logpathbase}raw/random.log + +for (( run=1; run<=num_runs; run++ )) +do + echo "Run #$run" + # Use explanations variant + # Direct + oaieval generation/direct/gpt-4-turbo-preview incontext_rl.v0 --record_path ${logpathbase}explanations/gpt-4-turbo-preview_${run}.log + oaieval generation/direct/gpt-3.5-turbo incontext_rl.v0 --record_path ${logpathbase}explanations/gpt-3.5-turbo_${run}.log + + # Raw variant + # Direct + oaieval generation/direct/gpt-4-turbo-preview incontext_rl.raw.v0 --record_path ${logpathbase}raw/gpt-4-turbo-preview_${run}.log + oaieval generation/direct/gpt-3.5-turbo incontext_rl.raw.v0 --record_path ${logpathbase}raw/gpt-3.5-turbo_${run}.log + +done + +echo Done running experiments, all logs in $logpathbase + +echo Producing plots for use_explanations variant, outputs to $outputdir +python plot_experiments.py --log_dir $logpathbase/explanations --out_dir $outputdir/explanations +echo Producing plots for raw variant, outputs to $outputdir +python plot_experiments.py --log_dir $logpathbase/raw --out_dir $outputdir/raw +set +x # Disable printing of each command after they've been executed diff --git a/evals/registry/data/incontext_rl/samples.jsonl b/evals/registry/data/incontext_rl/samples.jsonl new file mode 100644 index 0000000000..acacbee595 --- /dev/null +++ b/evals/registry/data/incontext_rl/samples.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:4a675930c0b31dcee9dca9f653085f9eb2b856c1284c289ed5501d44bd94fec5 +size 4138 diff --git a/evals/registry/data/incontext_rl/samples_dev.jsonl b/evals/registry/data/incontext_rl/samples_dev.jsonl new file mode 100644 index 0000000000..110af6c8cf --- /dev/null +++ b/evals/registry/data/incontext_rl/samples_dev.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:863664b313c3c8e77e3da6ad6f0ef695e8f86ff9d1ecdd7d5fcf0d408bf464da +size 1617 diff --git a/evals/registry/data/incontext_rl/samples_gymnasium_only.jsonl b/evals/registry/data/incontext_rl/samples_gymnasium_only.jsonl new file mode 100644 index 0000000000..d0448241d0 --- /dev/null +++ b/evals/registry/data/incontext_rl/samples_gymnasium_only.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:7314053ae7203d627611fadb2d5f04f2aa6b001def00047bca206d0db43cb62b +size 3455 diff --git a/evals/registry/evals/incontext_rl.yaml b/evals/registry/evals/incontext_rl.yaml new file mode 100644 index 0000000000..e66f358569 --- /dev/null +++ b/evals/registry/evals/incontext_rl.yaml @@ -0,0 +1,62 @@ +incontext_rl: + id: incontext_rl.gymnasium.v0 + metrics: [] + +incontext_rl.v0: + class: evals.elsuite.incontext_rl.eval:InContextRl + args: + use_explanations: True + samples_jsonl: incontext_rl/samples.jsonl + +incontext_rl.raw.v0: + class: evals.elsuite.incontext_rl.eval:InContextRl + args: + use_explanations: False + samples_jsonl: incontext_rl/samples.jsonl + +incontext_rl.gymnasium.v0: + class: evals.elsuite.incontext_rl.eval:InContextRl + args: + use_explanations: True + samples_jsonl: incontext_rl/samples_gymnasium_only.jsonl + +incontext_rl.gymnasium.raw.v0: + class: evals.elsuite.incontext_rl.eval:InContextRl + args: + use_explanations: False + samples_jsonl: incontext_rl/samples_gymnasium_only.jsonl + +incontext_rl.short.v0: + class: evals.elsuite.incontext_rl.eval:InContextRl + args: + use_explanations: True + max_steps: 100 + samples_jsonl: incontext_rl/samples.jsonl + +incontext_rl.raw.short.v0: + class: evals.elsuite.incontext_rl.eval:InContextRl + args: + use_explanations: False + max_steps: 100 + samples_jsonl: incontext_rl/samples.jsonl + +incontext_rl.gymnasium.short.v0: + class: evals.elsuite.incontext_rl.eval:InContextRl + args: + use_explanations: True + max_steps: 100 + samples_jsonl: incontext_rl/samples_gymnasium_only.jsonl + +incontext_rl.gymnasium.raw.short.v0: + class: evals.elsuite.incontext_rl.eval:InContextRl + args: + use_explanations: False + max_steps: 100 + samples_jsonl: incontext_rl/samples_gymnasium_only.jsonl + +incontext_rl.dev.v0: + class: evals.elsuite.incontext_rl.eval:InContextRl + args: + use_explanations: True + max_steps: 5 + samples_jsonl: incontext_rl/samples.jsonl \ No newline at end of file diff --git a/evals/registry/solvers/incontext_rl.yaml b/evals/registry/solvers/incontext_rl.yaml new file mode 100644 index 0000000000..e374f2e75d --- /dev/null +++ b/evals/registry/solvers/incontext_rl.yaml @@ -0,0 +1,27 @@ +incontext_rl/random: + class: evals.elsuite.incontext_rl.baselines:RandomSolver + +incontext_rl/q-learning: + class: evals.elsuite.incontext_rl.baselines:QlearningSolver + +incontext_rl/anti-cot/gpt-3.5-turbo: + class: evals.elsuite.incontext_rl.anti-cot_solver:AntiCoTSolver + args: + solver: + class: evals.solvers.openai_solver:OpenAISolver + args: + completion_fn_options: + model: gpt-3.5-turbo + extra_options: + temperature: 1 + +incontext_rl/anti-cot/gpt-4-turbo-preview: + class: evals.elsuite.incontext_rl.anti-cot_solver:AntiCoTSolver + args: + solver: + class: evals.solvers.openai_solver:OpenAISolver + args: + completion_fn_options: + model: gpt-4-turbo-preview + extra_options: + temperature: 1 \ No newline at end of file diff --git a/pyproject.toml b/pyproject.toml index 057f38b9eb..bc146903aa 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -34,6 +34,7 @@ dependencies = [ "jiwer", "seaborn", "statsmodels", + "gymnasium", "networkx", "chess", ]