-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Fix cuda storage transfer deadlock on multiple GPUs #788
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #788 +/- ##
==========================================
- Coverage 82.49% 82.46% -0.03%
==========================================
Files 1071 1071
Lines 80119 80167 +48
Branches 12202 12207 +5
==========================================
+ Hits 66094 66110 +16
- Misses 12478 12496 +18
- Partials 1547 1561 +14
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
The original transfer process is illustrated below.Each The solution is to move part of the receiver's logic to the handler. When opening a writer, the process checks whether the current handler and receiver are in the same pool. If they are, the handler opens the writer and passes it to the receiver. Otherwise, the receiver requests another handler in the same pool to open the writer. The updated process is shown below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fix storage transfer deadlock of CUDA storage on multiple GPUs. I do this work based on #488
The current implementation of the transfer function leads to a deadlock when executing Xorbits on multiple GPUs. The issue arises from the
StorageHandlerActor.fetch_batch
function, which invokesSenderManagerActor.send_batch_data
and subsequently callsStorageHandlerActor.request_quota_with_spill
. Due to the locking mechanism within the StorageHandlerActor method call, a deadlock arises.NOTEs:
test_transfer_gpu.py
of ucx channel has bugs while socket channel works.Check code requirements