Fix for incorrect channel read behavior after accelerated DAG teardown #46320

jackhumphries · 2024-06-28T08:11:41Z

Why are these changes needed?

Prior to this PR (described in #46284), calling ray.get() on a CompiledDAGRef (i.e., a channel) after DAG teardown would return a large series of zeroes. This issue could be reproduced with this script:

import ray
from ray.dag import InputNode

@ray.remote
class Actor:
    def foo(self, arg):
        return arg
        
a = Actor.remote()
with InputNode() as inp:
    dag = a.foo.bind(inp)
    
dag = dag.experimental_compile()
x = dag.execute(1)
dag.teardown()
# `ray.get(x)` returns a large series of zeroes.
print(ray.get(x))

This issue happened because the channel was unregistered with the mutable object manager on DAG teardown, and thus on a subsequent access to the channel, the core worker thought the channel reference was for a normal immutable Ray object rather than for a channel mutable object. Thus, the core worker was returning the raw underlying memory for the mutable object, and the memory buffers were sized equal to the total size of the underlying memory, not the amount of data in the mutable object.

This PR fixes this issue by properly checking that a channel is either currently registered or previously registered, rather than just checking only that the channel is currently registered.

Related issue number

Closes #46284

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

kevin85421 · 2024-06-28T08:42:49Z

This issue happened because the channel was unregistered with the mutable object manager on DAG teardown

Where does unregister happen? Are you referring to SetErrorInternal?

the core worker thought the channel reference was for a normal immutable Ray object rather than for a channel mutable object. Thus, the core worker was returning the raw underlying memory for the mutable object

Are you suggesting that we should use GetExperimentalMutableObjects to read the object for the correct result, but due to the channel being unregistered, GetObjects was used instead?

jackhumphries · 2024-06-28T09:16:38Z

This issue happened because the channel was unregistered with the mutable object manager on DAG teardown

Where does unregister happen? Are you referring to SetErrorInternal?

the core worker thought the channel reference was for a normal immutable Ray object rather than for a channel mutable object. Thus, the core worker was returning the raw underlying memory for the mutable object

Are you suggesting that we should use GetExperimentalMutableObjects to read the object for the correct result, but due to the channel being unregistered, GetObjects was used instead?

When I said "unregistered", I meant that SetErrorInternal() was called on the channel, as you said. This method then sets reader_registered to false for the channel.

For the second question, what you said is correct.

ruisearch42

Great findings!

src/ray/core_worker/core_worker.cc

kevin85421

The PR description makes sense to me, but I have a question:

Based on my observation at #46284 (comment), the actor method can get an error message like (Actor pid=2054169) check_status: Channel closed. False True. This means that:

Shared memory channel calls close(), and writes error into has_error successfully. This also implies that the channels are unregistered.
However, it can still get the error which means it calls CheckHasError which seems only be called in ReadAcquire. If it is unregistered, it should not call ReadAcquire. There might be some time difference between setting has_error and unregistration. However, the actor always prints the log.

src/ray/core_worker/core_worker.cc

jackhumphries · 2024-06-28T18:32:43Z

The PR description makes sense to me, but I have a question:

Based on my observation at #46284 (comment), the actor method can get an error message like (Actor pid=2054169) check_status: Channel closed. False True. This means that:

Shared memory channel calls close(), and writes error into has_error successfully. This also implies that the channels are unregistered.

However, it can still get the error which means it calls CheckHasError which seems only be called in ReadAcquire. If it is unregistered, it should not call ReadAcquire. There might be some time difference between setting has_error and unregistration. However, the actor always prints the log.

I think what's happening is the actor is calling WriteAcquire() on the same channel, which does not call WriterChannelRegistered() to check if the channel is registered. Currently, CoreWorker::Get() does check ReaderChannelRegistered(), which is why the behavior is different.

After this PR is merged, can you add a check for WriterChannelRegistered()? I think this could be a good way to get more familiar with the C++ codebase. Thanks!

kevin85421

Nice!

python/ray/dag/tests/experimental/test_accelerated_dag.py

kevin85421 · 2024-06-28T21:44:38Z

After this PR is merged, can you add a check for WriterChannelRegistered()? I think this could be a good way to get more familiar with the C++ codebase. Thanks!

Sounds good. Thanks!

ruisearch42

Looks good to approve. Still wondering if the API can be improved.

src/ray/core_worker/core_worker.cc

python/ray/dag/tests/experimental/test_accelerated_dag.py

src/ray/core_worker/experimental_mutable_object_provider.h

stephanie-wang · 2024-06-28T22:44:00Z

Should we just not reset reader_registered and writer_registered to false for now? That seems like a simpler fix.

jackhumphries · 2024-06-28T22:59:03Z

Should we just not reset reader_registered and writer_registered to false for now? That seems like a simpler fix.

I'd prefer to keep this as is, if that's alright. When I do the channel garbage collection work, I would imagine I'm going to need to implement what is currently in this PR if we don't keep it.

Signed-off-by: Jack Humphries <1645405+jackhumphries@users.noreply.github.com>

jackhumphries changed the title ~~Channel error fix after accelerated DAG teardown~~ Fix for incorrect channel read behavior after accelerated DAG teardown Jun 28, 2024

jackhumphries assigned stephanie-wang, kevin85421 and ruisearch42 Jun 28, 2024

jackhumphries force-pushed the 46284-fix branch 2 times, most recently from dea62f1 to 2154d1c Compare June 28, 2024 08:20

ruisearch42 reviewed Jun 28, 2024

View reviewed changes

src/ray/core_worker/core_worker.cc Outdated Show resolved Hide resolved

src/ray/core_worker/core_worker.cc Outdated Show resolved Hide resolved

kevin85421 reviewed Jun 28, 2024

View reviewed changes

src/ray/core_worker/core_worker.cc Show resolved Hide resolved

src/ray/core_worker/core_worker.cc Outdated Show resolved Hide resolved

jackhumphries force-pushed the 46284-fix branch from 7c32dd3 to 0855a1e Compare June 28, 2024 18:20

jackhumphries mentioned this pull request Jun 28, 2024

[core][experimental] Calling ray.get() on CompiledDAGRef after dag.teardown() or actor failure hangs #46284

Closed

kevin85421 approved these changes Jun 28, 2024

View reviewed changes

kevin85421 reviewed Jun 28, 2024

View reviewed changes

python/ray/dag/tests/experimental/test_accelerated_dag.py Outdated Show resolved Hide resolved

jackhumphries added the go add ONLY when ready to merge, run all tests label Jun 28, 2024

ruisearch42 reviewed Jun 28, 2024

View reviewed changes

src/ray/core_worker/core_worker.cc Outdated Show resolved Hide resolved

python/ray/dag/tests/experimental/test_accelerated_dag.py Outdated Show resolved Hide resolved

src/ray/core_worker/experimental_mutable_object_provider.h Outdated Show resolved Hide resolved

kevin85421 approved these changes Jun 29, 2024

View reviewed changes

jackhumphries added 8 commits June 29, 2024 02:16

Work

e542d05

Signed-off-by: Jack Humphries <1645405+jackhumphries@users.noreply.github.com>

Fix

33e484b

Signed-off-by: Jack Humphries <1645405+jackhumphries@users.noreply.github.com>

Fixes

6152753

Signed-off-by: Jack Humphries <1645405+jackhumphries@users.noreply.github.com>

Fix

9a0a4c7

Signed-off-by: Jack Humphries <1645405+jackhumphries@users.noreply.github.com>

Work

8453d2e

Signed-off-by: Jack Humphries <1645405+jackhumphries@users.noreply.github.com>

Fix

9c18967

Signed-off-by: Jack Humphries <1645405+jackhumphries@users.noreply.github.com>

Fix

a26d910

Signed-off-by: Jack Humphries <1645405+jackhumphries@users.noreply.github.com>

Fix

262be37

Signed-off-by: Jack Humphries <1645405+jackhumphries@users.noreply.github.com>

jackhumphries force-pushed the 46284-fix branch from ecd8cf4 to 262be37 Compare June 29, 2024 02:17

Fix

b6ed7cb

Signed-off-by: Jack Humphries <1645405+jackhumphries@users.noreply.github.com>

can-anyscale merged commit 8a0d633 into ray-project:master Jul 1, 2024
5 of 6 checks passed

kevin85421 mentioned this pull request Jul 9, 2024

[core][experimental] Check whether the channel is closed for the shared memory write operation #46508

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for incorrect channel read behavior after accelerated DAG teardown #46320

Fix for incorrect channel read behavior after accelerated DAG teardown #46320

jackhumphries commented Jun 28, 2024 •

edited

Loading

kevin85421 commented Jun 28, 2024

jackhumphries commented Jun 28, 2024

ruisearch42 left a comment

kevin85421 left a comment

jackhumphries commented Jun 28, 2024 •

edited

Loading

kevin85421 left a comment

kevin85421 commented Jun 28, 2024

ruisearch42 left a comment

stephanie-wang commented Jun 28, 2024

jackhumphries commented Jun 28, 2024

Fix for incorrect channel read behavior after accelerated DAG teardown #46320

Fix for incorrect channel read behavior after accelerated DAG teardown #46320

Conversation

jackhumphries commented Jun 28, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

kevin85421 commented Jun 28, 2024

jackhumphries commented Jun 28, 2024

ruisearch42 left a comment

Choose a reason for hiding this comment

kevin85421 left a comment

Choose a reason for hiding this comment

jackhumphries commented Jun 28, 2024 • edited Loading

kevin85421 left a comment

Choose a reason for hiding this comment

kevin85421 commented Jun 28, 2024

ruisearch42 left a comment

Choose a reason for hiding this comment

stephanie-wang commented Jun 28, 2024

jackhumphries commented Jun 28, 2024

jackhumphries commented Jun 28, 2024 •

edited

Loading

jackhumphries commented Jun 28, 2024 •

edited

Loading