Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EOFError error during remote_worker_envs flags #46346

Open
XavierGeerinck opened this issue Jun 30, 2024 · 3 comments
Open

EOFError error during remote_worker_envs flags #46346

XavierGeerinck opened this issue Jun 30, 2024 · 3 comments
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical rllib RLlib related issues rllib-env rllib env related issues

Comments

@XavierGeerinck
Copy link

What happened + What you expected to happen

I am trying to get training to work while setting remote_worker_envs to true, but I am getting an EOFError

Traceback (most recent call last):
  File "~/.venv/lib/python3.11/site-packages/ray/rllib/env/env_runner_group.py", line 169, in __init__
    self._setup(
  File "~/.venv/lib/python3.11/site-packages/ray/rllib/env/env_runner_group.py", line 239, in _setup
    self.add_workers(
  File "~/.venv/lib/python3.11/site-packages/ray/rllib/env/env_runner_group.py", line 799, in add_workers
    raise result.get()
  File "~/.venv/lib/python3.11/site-packages/ray/rllib/utils/actor_manager.py", line 500, in _fetch_result
    result = ray.get(ready)
             ^^^^^^^^^^^^^^
  File "~/.venv/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "~/.venv/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "~/.venv/lib/python3.11/site-packages/ray/_private/worker.py", line 2639, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.venv/lib/python3.11/site-packages/ray/_private/worker.py", line 866, in get_objects
    raise value
ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, ray::SingleAgentEnvRunner.__init__() (pid=83332, ip=127.0.0.1, actor_id=1ca45720433ca900e1057f5801000000, repr=<ray.rllib.env.single_agent_env_runner.SingleAgentEnvRunner object at 0x149cc1210>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.venv/lib/python3.11/site-packages/ray/rllib/env/single_agent_env_runner.py", line 79, in __init__
    self.make_env()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.venv/lib/python3.11/site-packages/ray/rllib/env/single_agent_env_runner.py", line 764, in make_env
    gym.vector.make(
  File "~/.venv/lib/python3.11/site-packages/gymnasium/vector/__init__.py", line 82, in make
    return AsyncVectorEnv(env_fns) if asynchronous else SyncVectorEnv(env_fns)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.venv/lib/python3.11/site-packages/gymnasium/vector/async_vector_env.py", line 169, in __init__
    self._check_spaces()
  File "~/.venv/lib/python3.11/site-packages/gymnasium/vector/async_vector_env.py", line 504, in _check_spaces
    results, successes = zip(*[pipe.recv() for pipe in self.parent_pipes])
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.venv/lib/python3.11/site-packages/gymnasium/vector/async_vector_env.py", line 504, in <listcomp>
    results, successes = zip(*[pipe.recv() for pipe in self.parent_pipes])
                               ^^^^^^^^^^^
  File "/Users/xaviergeerinck/.pyenv/versions/3.11.8/lib/python3.11/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
          ^^^^^^^^^^^^^^^^^^
  File "/Users/xaviergeerinck/.pyenv/versions/3.11.8/lib/python3.11/multiprocessing/connection.py", line 430, in _recv_bytes
    buf = self._recv(4)
          ^^^^^^^^^^^^^
  File "/Users/xaviergeerinck/.pyenv/versions/3.11.8/lib/python3.11/multiprocessing/connection.py", line 399, in _recv
    raise EOFError
EOFError

Versions / Dependencies

platform : macOS 14.2 23C64 (arm64)
memory : 48.0 GB
cpu : 16 cores
mac : 6a:3c:67:54:c1:4d
ip : 192.168.4.76
model_info : Mac15,9 (MUW73LL/A)
kernel_version : 23.2.0
git_commit_sha : Unknown
python_version : 3.11.8 (/.venv/bin/python)
pip_version : 24.0 (
/.venv/lib/python3.11/site-packages/pip)
torch_version : 2.3.0
docker_version : 24.0.7,
kubernetes_version : 1.28.2
ray_version : 2.31.0
nvidia_smi : Unknown, nvidia-smi was not found
nvidia_cuda : Unknown, nvcc was not found
is_tty : True

Reproduction script

.env_runners(
    num_env_runners=1,
    num_envs_per_env_runner=8,
    num_cpus_per_env_runner=1,
    num_gpus_per_env_runner=0,
    sample_timeout_s=60,
    remote_worker_envs=True,
    rollout_fragment_length="auto",
)

Issue Severity

High: It blocks me from completing my task.

@XavierGeerinck XavierGeerinck added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 30, 2024
@XavierGeerinck XavierGeerinck changed the title [<Ray component: Core|RLlib|etc...>] EOFError error during remote_worker_envs flags Jul 1, 2024
@anyscalesam anyscalesam added the core Issues that should be addressed in Ray Core label Jul 8, 2024
@jjyao jjyao added rllib RLlib related issues and removed core Issues that should be addressed in Ray Core labels Jul 8, 2024
@simonsays1980
Copy link
Collaborator

@XavierGeerinck Thanks for raising this issue. Can you provide a reproducable example? I guess this might be in the new API stack which does not support asynchronous vector environments (yet - we wait for a gymnasium update).

@simonsays1980 simonsays1980 added rllib-env rllib env related issues P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 9, 2024
@XavierGeerinck
Copy link
Author

Awesome! We indeed are thinking the same and are awaiting the 1.0.0a2 release to start testing . Is there any ETA currently that you are aware of?

@simonsays1980
Copy link
Collaborator

This should come soon, but we know of no ETA in regard to it. Would it help, for the time being to just sample with more Env Runners but a single env in each of them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical rllib RLlib related issues rllib-env rllib env related issues
Projects
None yet
Development

No branches or pull requests

4 participants