-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][experimental] Calling ray.get() on CompiledDAGRef after dag.teardown() or actor failure hangs #46284
Comments
possible root cause for #46253 ... jack and kai-hsun to look into it. |
I reproduced this with the following script:
|
Yes, we want it to work in both cases, whether the DAG has been explicitly torn down or if an actor in the DAG failed. The only difference is that the latter case should also print out an |
The issue seems to be that an IOError (channel closed) is being properly returned by CoreWorker::Get(), but the Python code is ignoring this error status and doing an invalid memory access to the buffer. |
I used the same example as #46284 (comment). I added a print function, Lines 560 to 561 in 755a49b
I have two questions:
|
I wrote a response for this in the PR: #46320 (comment) |
#46320) Prior to this PR (described in #46284), calling `ray.get()` on a `CompiledDAGRef` (i.e., a channel) after DAG teardown would return a large series of zeroes. This issue could be reproduced with this script: ``` import ray from ray.dag import InputNode @ray.remote class Actor: def foo(self, arg): return arg a = Actor.remote() with InputNode() as inp: dag = a.foo.bind(inp) dag = dag.experimental_compile() x = dag.execute(1) dag.teardown() # `ray.get(x)` returns a large series of zeroes. print(ray.get(x)) ``` This issue happened because the channel was unregistered with the mutable object manager on DAG teardown, and thus on a subsequent access to the channel, the core worker thought the channel reference was for a normal immutable Ray object rather than for a channel mutable object. Thus, the core worker was returning the raw underlying memory for the mutable object, and the memory buffers were sized equal to the total size of the underlying memory, not the amount of data in the mutable object. This PR fixes this issue by properly checking that a channel is either currently registered or previously registered, rather than just checking only that the channel is currently registered. Signed-off-by: Jack Humphries <1645405+jackhumphries@users.noreply.github.com>
What happened + What you expected to happen
We should throw an error if
ray.get()
is called on a CompiledDAGRef after the DAG has already been torn down. Instead, it seems thatray.get()
returns an infinite string of 0s and hangs.Versions / Dependencies
3.0dev
Reproduction script
Issue Severity
None
The text was updated successfully, but these errors were encountered: