-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][Compiled Graph] Execute DAG on Actor's Main Thread #48608
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: xsling <me@xsl.sh>
Thanks, @xslingcn ! Can you add a test similar to the one that you have in the PR description? Also, I think there may be some issues since there are sometimes background system tasks that need to run on a different thread from the execution one, and now they may be blocked. I'll unblock the buildkite/premerge tests to run so you can see the failures. |
Thanks @stephanie-wang for reviewing this!
Done with 8f6be8f .
It seems that ray now fails to teardown the dags, even if I moved the cancelation tasks to diff --git a/python/ray/dag/compiled_dag_node.py b/python/ray/dag/compiled_dag_node.py
@@ -1940,7 +1938,9 @@ class CompiledDAG:
logger.info(f"Cancelling compiled worker on actor: {actor}")
# Cancel all actor loops in parallel.
cancel_refs = [
- actor.__ray_call__.remote(do_cancel_executable_tasks, tasks)
+ actor.__ray_call__.options(
+ concurrency_group="_ray_system"
+ ).remote(do_cancel_executable_tasks, tasks)
for actor, tasks in outer.actor_to_executable_tasks.items()
]
for cancel_ref in cancel_refs: $ pytest -v -s python/ray/dag/tests/experimental/test_multi_node_dag.py
...
2024-11-16 10:34:10,738 INFO compiled_dag_node.py:1933 -- Tearing down compiled DAG
2024-11-16 10:34:10,738 INFO compiled_dag_node.py:1938 -- Cancelling compiled worker on actor: Actor(Actor, 95b3104e737fd143bb49c3a001000000)
2024-11-16 10:34:10,739 INFO compiled_dag_node.py:1938 -- Cancelling compiled worker on actor: Actor(Actor, f8bf38b6e3680bccb52e420a01000000)
2024-11-16 10:34:10,739 INFO compiled_dag_node.py:1938 -- Cancelling compiled worker on actor: Actor(Actor, 5080863784c80043a0018f8201000000)
2024-11-16 10:34:40,745 ERROR compiled_dag_node.py:1954 -- Error cancelling worker task
Traceback (most recent call last):
File "/root/ray/python/ray/dag/compiled_dag_node.py", line 1948, in teardown
ray.get(cancel_ref, timeout=30)
File "/root/ray/python/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/root/ray/python/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/ray/python/ray/_private/worker.py", line 2755, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/ray/python/ray/_private/worker.py", line 882, in get_objects
] = self.core_worker.get_objects(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "python/ray/_raylet.pyx", line 3486, in ray._raylet.CoreWorker.get_objects
check_status(op_status)
File "python/ray/includes/common.pxi", line 81, in ray._raylet.check_status
raise GetTimeoutError(message)
ray.exceptions.GetTimeoutError: Get timed out: some object(s) not ready. |
Why are these changes needed?
As mentioned in #46336, the current implementation executes all aDAGs in a background concurrency group
_ray_system
, and actors run in their own default concurrency group. This discrepancy blocks the DAG from accessing thread-local states within actors that were initialized prior to the DAG execution. For example, consider the following code:which will raise the error:
This PR makes the DAG execution loop to run on the actor's default executor (ref), which ensures the DAG running on the same thread as the actor. Now the example provided above should produce the expected output.
The
thread_name
API discussed in the original issue will be implemented in a seperate PR later.Related issue number
#46336
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.