-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for incorrect channel read behavior after accelerated DAG teardown #46320
Conversation
dea62f1
to
2154d1c
Compare
Where does
Are you suggesting that we should use |
When I said "unregistered", I meant that For the second question, what you said is correct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great findings!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR description makes sense to me, but I have a question:
Based on my observation at #46284 (comment), the actor method can get an error message like (Actor pid=2054169) check_status: Channel closed. False True
. This means that:
- Shared memory channel calls
close()
, and writes error intohas_error
successfully. This also implies that the channels are unregistered. - However, it can still get the error which means it calls
CheckHasError
which seems only be called inReadAcquire
. If it is unregistered, it should not callReadAcquire
. There might be some time difference between settinghas_error
and unregistration. However, the actor always prints the log.
I think what's happening is the actor is calling After this PR is merged, can you add a check for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Sounds good. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to approve. Still wondering if the API can be improved.
Should we just not reset reader_registered and writer_registered to false for now? That seems like a simpler fix. |
I'd prefer to keep this as is, if that's alright. When I do the channel garbage collection work, I would imagine I'm going to need to implement what is currently in this PR if we don't keep it. |
Why are these changes needed?
Prior to this PR (described in #46284), calling
ray.get()
on aCompiledDAGRef
(i.e., a channel) after DAG teardown would return a large series of zeroes. This issue could be reproduced with this script:This issue happened because the channel was unregistered with the mutable object manager on DAG teardown, and thus on a subsequent access to the channel, the core worker thought the channel reference was for a normal immutable Ray object rather than for a channel mutable object. Thus, the core worker was returning the raw underlying memory for the mutable object, and the memory buffers were sized equal to the total size of the underlying memory, not the amount of data in the mutable object.
This PR fixes this issue by properly checking that a channel is either currently registered or previously registered, rather than just checking only that the channel is currently registered.
Related issue number
Closes #46284
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.