-
Notifications
You must be signed in to change notification settings - Fork 327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving communication overlap for the case of multi kernel queue usage #1308
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
for more information, see https://pre-commit.ci
@erhoo82 Hi Sangkug, this is a PR for launch ordering work. Could you please assign a reviewer? |
transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp
Outdated
Show resolved
Hide resolved
transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp
Outdated
Show resolved
Hide resolved
transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp
Outdated
Show resolved
Hide resolved
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
for more information, see https://pre-commit.ci
if (comm_launch_event) { | ||
SETUP_LAUNCH_CONFIG_WITH_COMPLETION_EVENT(sms, warps * 32, stream, comm_launch_event); | ||
callranks_rs_oop_fp8(2) callranks_rs_oop_fp8(4) callranks_rs_oop_fp8(8) | ||
} else { | ||
SETUP_LAUNCH_CONFIG(sms, warps * 32, stream); | ||
callranks_rs_oop_fp8(2) callranks_rs_oop_fp8(4) callranks_rs_oop_fp8(8) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here for duplicated kernel launch code.
if (comm_launch_event) { | |
SETUP_LAUNCH_CONFIG_WITH_COMPLETION_EVENT(sms, warps * 32, stream, comm_launch_event); | |
callranks_rs_oop_fp8(2) callranks_rs_oop_fp8(4) callranks_rs_oop_fp8(8) | |
} else { | |
SETUP_LAUNCH_CONFIG(sms, warps * 32, stream); | |
callranks_rs_oop_fp8(2) callranks_rs_oop_fp8(4) callranks_rs_oop_fp8(8) | |
} | |
if (comm_launch_event) { | |
SETUP_LAUNCH_CONFIG_WITH_COMPLETION_EVENT(sms, warps * 32, stream, comm_launch_event); | |
} else { | |
SETUP_LAUNCH_CONFIG(sms, warps * 32, stream); | |
} | |
callranks_rs_oop_fp8(2) callranks_rs_oop_fp8(4) callranks_rs_oop_fp8(8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @denera, the suggested coding style causes a compile error, which is why I had to do a duplicated kernel launch...
Since both SETUP_LAUNCH_CONFIG
and callranks_**
are define functions, there is a variable scope issue. The compute kernel call should be in the same or lower scope than the SETUP kernel. This issue applies the same to the other comments. If you have a better solution for this, please let me know.
@youngeunkwon0405 The TP overlap unit tests explicitly set Also please launch the L1 tests with |
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
for more information, see https://pre-commit.ci
Hi @denera, I have updated the Also, could you please elaborate on more details about the following? I am new to writing a test and also new to the ci process.
I have tested the modified test case only and the following was a new result. ../lustre/fsw/coreai_dlalgo_llm/youngeunk/nemo/nemo.dev/mount/TransformerEngine-youngeunk/tests/pytorch/distributed/test_comm_gemm_overlap.py::test_bulk_overlaps[ALL-GATHER - BF16 - 1 connections] PASSED ========================= 6 passed in 93.35s (0:01:33) ========================= |
/te-ci pytorch L1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, pending rebase on latest TE/main and clean CI results.
@denera Rebased with the main. Could you please let me know what the next step would be? |
/te-ci pytorch L1 |
/te-ci pytorch L1 |
Description
The current TP-overlap relay is on a single kernel queue to configure launch ordering to control compute-communication overlap, which fails to overlap when multi kernel queue is used.
This PR enforces launch ordering using the LaunchCompletionEvent feature between the communication kernel and the compute kernel to ensure the overlap.
This feature is specific to Hopper and applies only to bulk overlap cases.
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: