ENH: Fix cuda storage transfer deadlock on multiple GPUs #788

luweizheng · 2024-07-06T00:44:14Z

Fix storage transfer deadlock of CUDA storage on multiple GPUs. I do this work based on #488

The current implementation of the transfer function leads to a deadlock when executing Xorbits on multiple GPUs. The issue arises from the StorageHandlerActor.fetch_batch function, which invokes SenderManagerActor.send_batch_data and subsequently calls StorageHandlerActor.request_quota_with_spill. Due to the locking mechanism within the StorageHandlerActor method call, a deadlock arises.

NOTEs:

fix dead lock on GPU
fix GPU buffer size 0 issue
test on TPC-H SF10 and it works. Much faster than Dask-cuDF.
known issue: test_transfer_gpu.py of ucx channel has bugs while socket channel works.
known issue: too many actor calls in this implement. May lead to performance down.

Check code requirements

tests added / passed (if needed)
Ensure all linting tests pass

…orage

codecov · 2024-07-06T06:16:45Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.46%. Comparing base (5bb0211) to head (b8ddb25).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #788      +/-   ##
==========================================
- Coverage   82.49%   82.46%   -0.03%     
==========================================
  Files        1071     1071              
  Lines       80119    80167      +48     
  Branches    12202    12207       +5     
==========================================
+ Hits        66094    66110      +16     
- Misses      12478    12496      +18     
- Partials     1547     1561      +14

Flag	Coverage Δ
unittests	`82.36% <ø> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

hucorz · 2024-11-15T08:59:30Z

The original transfer process is illustrated below.Each StorageHandler corresponds to a Receiver. During a transfer, one handler requests the receiver to create a writer, and the receiver then asks a handler in the same pool to create the writer. However, in the case of GPUs, there is only one handler, leading to a deadlock.

The solution is to move part of the receiver's logic to the handler. When opening a writer, the process checks whether the current handler and receiver are in the same pool. If they are, the handler opens the writer and passes it to the receiver. Otherwise, the receiver requests another handler in the same pool to open the writer. The updated process is shown below.

hucorz

LGTM

UranusSeven and others added 23 commits September 25, 2023 15:07

REF: refactor transfer to avoid deadlocks

4048a5b

Debugging

be78817

Debugging

640864a

Debugging

6ab37e1

Fix

ad42bc8

Debugging

f434eab

Fix

4d80dbb

Fix

30f58a7

checkpoint

6c13816

fix

8fea60f

REF: refactor transfer to avoid deadlocks

5dafe70

Debugging

89ad935

Debugging

dd63a1c

Debugging

df23a15

Fix

072848f

Debugging

373e7bc

Fix

aa264be

Fix

010871d

checkpoint

4a629ab

fix

faa2096

Merge branch 'ref/transfer' of github.com:UranusSeven/xorbits into st…

c485cb1

…orage

test

4026b4c

remove debug log

27a8333

XprobeBot added enhancement New feature or request gpu labels Jul 6, 2024

XprobeBot added this to the v0.7.3 milestone Jul 6, 2024

luweizheng force-pushed the feat/cudf branch from ba742d5 to 27a8333 Compare July 6, 2024 05:51

Merge branch 'main' into feat/cudf

77cf425

delete logs

c26c46d

luweizheng and others added 3 commits July 7, 2024 22:15

Merge branch 'feat/cudf' of github.com:luweizheng/xorbits into feat/cudf

7ceaae8

debug log

9ba9de3

Merge branch 'main' into feat/cudf

65b37e9

XprobeBot modified the milestones: v0.7.3, v0.7.4 Aug 22, 2024

luweizheng and others added 19 commits August 23, 2024 18:53

merge main

1433869

fix test

af2fe83

Merge branch 'main' into feat/cudf

4a059f4

Merge branch 'main' into feat/cudf

44fa88a

Merge branch 'main' into feat/cudf

3330f52

options & chunking

0166a07

Merge branch 'config' into feat/cudf

dc8748a

logging

f25017c

test

99111c8

flake8

69e653d

lint

397fbe7

lint

02c7fc7

config

229998b

Merge branch 'config' into feat/cudf

5c467bb

merge main

b7ed36c

Merge branch 'main' into feat/cudf

5bfb8b7

Merge branch 'main' into feat/cudf

7330ac6

fix run on cpu

b3195b3

fix lint

5517507

remove learn currently

b8ddb25

hucorz approved these changes Nov 16, 2024

View reviewed changes

luweizheng merged commit 573ee79 into xorbitsai:main Nov 16, 2024
38 of 40 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Fix cuda storage transfer deadlock on multiple GPUs #788

ENH: Fix cuda storage transfer deadlock on multiple GPUs #788

luweizheng commented Jul 6, 2024

codecov bot commented Jul 6, 2024 •

edited

Loading

hucorz commented Nov 15, 2024

hucorz left a comment

ENH: Fix cuda storage transfer deadlock on multiple GPUs #788

ENH: Fix cuda storage transfer deadlock on multiple GPUs #788

Conversation

luweizheng commented Jul 6, 2024

Check code requirements

codecov bot commented Jul 6, 2024 • edited Loading

Codecov Report

hucorz commented Nov 15, 2024

hucorz left a comment

Choose a reason for hiding this comment

codecov bot commented Jul 6, 2024 •

edited

Loading