Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asymmetric memory #1000

Merged
merged 2 commits into from
Sep 5, 2024
Merged

Conversation

nsarka
Copy link
Collaborator

@nsarka nsarka commented Jul 12, 2024

In our UCC call, we defined two types of "asymmetric memory":

  • Weak asymmetric memory: When the src and dst mem_type's mismatch, but the mismatch is the same across ranks
  • Strong asymmetric memory: No rules for mem types at all, even across ranks. Src/dst mem_type's can mismatch on one rank but match on another

Since strong asymmetric memory would require an allreduce for every ucc_collective_init, this PR focuses on weak asymmetric memory.

Previously, what would happen is when the src/dst mismatch we can see that they mismatch on every rank, so everybody errors out saying we don't support asymmetric memory. But on rooted collectives, the root will do the same, but on non-root ranks there is only one src OR dst buffer, so there's no way to know that the root is going to error out. This caused a hang.

So, this PR will enable asymmetric memory for rooted collectives. On the root, it assumes the src buffer is the "true" mem_type, and makes a scratch allocation for the dst buffer of size dst.info.count * dt_size(dst.info.mem_type) with mem_type src.info.mem_type. After this, it will run the collective. Once the collective completes, it will copy out from the scratch allocation into the old dst buffer.

@Sergei-Lebedev My only remaining question is what should we do with non-rooted collectives? Should they still error out? Or should this apply to them too?

@nsarka nsarka force-pushed the nsarka/asymmetric-memory branch 3 times, most recently from df217dc to b4d4338 Compare July 12, 2024 17:21
@samnordmann
Copy link
Collaborator

Is it ready for review? If so, please add the label

@nsarka
Copy link
Collaborator Author

nsarka commented Jul 18, 2024 via email

Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It looks great overall! I just left a bunch of very minor comments. A couple of further question/notes:

  1. The CI error looks relevant
  2. For my own curiosity: Is there a concrete motivation for this patch? I mean, is it a user's request, or do we only think it is a nice feature to have?
  3. Can we hope (in the future) to have a more efficient way to handle asymmetric memory, or does it need to be performed, as here, by emulating symmetric memory with a scratch buffer and copying in/out to the user's buffer?

src/coll_score/ucc_coll_score_map.c Show resolved Hide resolved
src/coll_score/ucc_coll_score_map.c Outdated Show resolved Hide resolved
src/core/ucc_coll.c Outdated Show resolved Hide resolved
src/schedule/ucc_schedule.h Outdated Show resolved Hide resolved
src/schedule/ucc_schedule.h Outdated Show resolved Hide resolved
test/gtest/asym_mem/test_asymmetric_memory.cc Outdated Show resolved Hide resolved
test/gtest/asym_mem/test_asymmetric_memory.cc Show resolved Hide resolved
test/gtest/asym_mem/test_asymmetric_memory.cc Outdated Show resolved Hide resolved
test/gtest/asym_mem/test_asymmetric_memory.cc Show resolved Hide resolved
test/gtest/asym_mem/test_asymmetric_memory.cc Show resolved Hide resolved
@nsarka nsarka force-pushed the nsarka/asymmetric-memory branch 2 times, most recently from 0bc716f to 62d54e9 Compare July 22, 2024 16:57
@nsarka
Copy link
Collaborator Author

nsarka commented Jul 22, 2024

  1. The CI error looks relevant

Thanks, I will take a look

  1. For my own curiosity: Is there a concrete motivation for this patch? I mean, is it a user's request, or do we only think it is a nice feature to have?

A hang was reported by the HPC SDK team (I think that's the name, I'm not sure, I heard it from Tommy) in the case of asymmetric memory in a rooted collective. Consider this case:
non-root ranks are UCC_MEMORY_TYPE_HOST, root rank is src=HOST dst=CUDA. Before this patch, the root will exit because the memory types are asymmetric. However, non-roots do not know that the root has exited. So there is a hang.

  1. Can we hope (in the future) to have a more efficient way to handle asymmetric memory, or does it need to be performed, as here, by emulating symmetric memory with a scratch buffer and copying in/out to the user's buffer?

I believe ucp can handle mismatched memory without any problems which is why I was wondering about implementing a copy-out mechanism like this for non-rooted collectives. I will test it and report back. For rooted collectives, I'm not sure yet if there's a better way.

@nsarka nsarka force-pushed the nsarka/asymmetric-memory branch 2 times, most recently from 6f7a658 to a446007 Compare July 25, 2024 15:27
src/schedule/ucc_schedule.h Outdated Show resolved Hide resolved
src/schedule/ucc_schedule.h Outdated Show resolved Hide resolved
src/utils/ucc_coll_utils.c Outdated Show resolved Hide resolved
src/utils/ucc_coll_utils.c Outdated Show resolved Hide resolved
src/utils/ucc_coll_utils.c Outdated Show resolved Hide resolved
src/utils/ucc_coll_utils.h Show resolved Hide resolved
src/utils/ucc_coll_utils.c Outdated Show resolved Hide resolved
@nsarka
Copy link
Collaborator Author

nsarka commented Aug 20, 2024

Hi @Sergei-Lebedev , I have addressed all of your comments. For scatter and scatterv, a new src buf will be allocated and copied into. To support this as a persistent collective I moved the copy-in to ucc_collective_post. Then I added a persistent gtest that will:

  • Init with a buffer
  • In a loop:
    1. Update the src buffer with a different value
    2. ucc_collective_post
    3. Data validate
  • Finalize

@nsarka
Copy link
Collaborator Author

nsarka commented Aug 20, 2024

CI is failing with:

failed to register layer: write libcupti.so.11.3:
no space left on device
Error: Docker pull failed with exit code 1

ucc test had a jenkins issue according to Artem

@nsarka
Copy link
Collaborator Author

nsarka commented Aug 21, 2024

bot:retest

src/utils/ucc_coll_utils.c Outdated Show resolved Hide resolved
src/utils/ucc_coll_utils.c Outdated Show resolved Hide resolved
src/utils/ucc_coll_utils.c Outdated Show resolved Hide resolved
src/utils/ucc_coll_utils.c Outdated Show resolved Hide resolved
src/utils/ucc_coll_utils.c Show resolved Hide resolved
@Sergei-Lebedev
Copy link
Contributor

@nsarka LGTM, please fix commit messages to pass codestyle check

@nsarka
Copy link
Collaborator Author

nsarka commented Sep 4, 2024

The CI is failing on:

[2024-09-03T23:44:41.401Z] [----------] 24 tests from test_tl_mlx5_dm
[2024-09-03T23:44:41.401Z] [ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/0
[2024-09-03T23:44:41.401Z] [swx-clx01:452  :0:452] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[2024-09-03T23:44:41.401Z] ==== backtrace (tid:    452) ====
[2024-09-03T23:44:41.401Z]  0  /opt/nvidia/bin/ucx/build-release-mt/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f173e614564]
[2024-09-03T23:44:41.401Z]  1  /opt/nvidia/bin/ucx/build-release-mt/lib/libucs.so.0(+0x3375f) [0x7f173e61475f]
[2024-09-03T23:44:41.401Z]  2  /opt/nvidia/bin/ucx/build-release-mt/lib/libucs.so.0(+0x33a46) [0x7f173e614a46]
[2024-09-03T23:44:41.401Z]  3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f173dd06520]
[2024-09-03T23:44:41.401Z]  4  /usr/lib/x86_64-linux-gnu/libibverbs.so.1(ibv_dereg_mr+0x25) [0x7f173e59c115]
[2024-09-03T23:44:41.401Z]  5  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x19916c9) [0x55d6cea2e6c9]
[2024-09-03T23:44:41.401Z]  6  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x57ed91) [0x55d6cd61bd91]
[2024-09-03T23:44:41.401Z]  7  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x5726e2) [0x55d6cd60f6e2]
[2024-09-03T23:44:41.401Z]  8  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x57296e) [0x55d6cd60f96e]
[2024-09-03T23:44:41.401Z]  9  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x573a49) [0x55d6cd610a49]
[2024-09-03T23:44:41.401Z] 10  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x573f48) [0x55d6cd610f48]
[2024-09-03T23:44:41.401Z] 11  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x537d02) [0x55d6cd5d4d02]
[2024-09-03T23:44:41.401Z] 12  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f173dcedd90]
[2024-09-03T23:44:41.401Z] 13  /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f173dcede40]
[2024-09-03T23:44:41.401Z] 14  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x54ba95) [0x55d6cd5e8a95]
[2024-09-03T23:44:41.401Z] =================================
[2024-09-03T23:44:43.291Z] make[1]: *** [Makefile:2030: test] Segmentation fault (core dumped)
[2024-09-03T23:44:43.291Z] make[1]: Leaving directory '/opt/nvidia/src/ucc/build/test/gtest'
[2024-09-03T23:44:43.291Z] make: *** [Makefile:1001: gtest] Error 2

This seems unrelated to my changes

@Sergei-Lebedev Sergei-Lebedev merged commit 7f85fba into openucx:master Sep 5, 2024
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants