-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NVIDIA GPU] LHS enhancement for multiple collective resources #19026
base: main
Are you sure you want to change the base?
[NVIDIA GPU] LHS enhancement for multiple collective resources #19026
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
Sorry, I don't follow why we need to expose ReadySetLt
to implement this functionality.
ReadySetLt
can take the IsResourceConstrained
as a function type to it's constructor (similarly as early_target_scheduling_rule_
) and just use this. It will be less code and it will be more coherent with how we do things in this codebase currently.
f839d6e
to
7ba2915
Compare
Thanks for the suggestion @golechwierowicz! I've refactored the code to follow the existing paradigm, could you take another look? |
7ba2915
to
09ce310
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
…rces Imported from GitHub PR #19026 With #17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend. Copybara import of the project: -- 09ce310 by Terry Sun <tesun@nvidia.com>: LHS deadlock avoidance Merging this change closes #19026 FUTURE_COPYBARA_INTEGRATE_REVIEW=#19026 from terryysun:terryysun/overlapping_collectives 09ce310 PiperOrigin-RevId: 696020313
…rces Imported from GitHub PR openxla/xla#19026 With openxla/xla#17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend. Copybara import of the project: -- 09ce310cc14f763ec8a0a711b1787f7798122d4c by Terry Sun <tesun@nvidia.com>: LHS deadlock avoidance Merging this change closes #19026 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#19026 from terryysun:terryysun/overlapping_collectives 09ce310cc14f763ec8a0a711b1787f7798122d4c PiperOrigin-RevId: 696020313
…rces Imported from GitHub PR #19026 With #17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend. Copybara import of the project: -- 09ce310 by Terry Sun <tesun@nvidia.com>: LHS deadlock avoidance Merging this change closes #19026 FUTURE_COPYBARA_INTEGRATE_REVIEW=#19026 from terryysun:terryysun/overlapping_collectives 09ce310 PiperOrigin-RevId: 696020313
…rces Imported from GitHub PR openxla/xla#19026 With openxla/xla#17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend. Copybara import of the project: -- 09ce310cc14f763ec8a0a711b1787f7798122d4c by Terry Sun <tesun@nvidia.com>: LHS deadlock avoidance Merging this change closes #19026 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#19026 from terryysun:terryysun/overlapping_collectives 09ce310cc14f763ec8a0a711b1787f7798122d4c PiperOrigin-RevId: 696020313
…rces Imported from GitHub PR #19026 With #17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend. Copybara import of the project: -- 09ce310 by Terry Sun <tesun@nvidia.com>: LHS deadlock avoidance Merging this change closes #19026 FUTURE_COPYBARA_INTEGRATE_REVIEW=#19026 from terryysun:terryysun/overlapping_collectives 09ce310 PiperOrigin-RevId: 696020313
I believe none of the |
…rces Imported from GitHub PR openxla/xla#19026 With openxla/xla#17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend. Copybara import of the project: -- 09ce310cc14f763ec8a0a711b1787f7798122d4c by Terry Sun <tesun@nvidia.com>: LHS deadlock avoidance Merging this change closes #19026 Reverts 2a9cc1c FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#19026 from terryysun:terryysun/overlapping_collectives 09ce310cc14f763ec8a0a711b1787f7798122d4c PiperOrigin-RevId: 696020313
…rces Imported from GitHub PR #19026 With #17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend. Copybara import of the project: -- 09ce310 by Terry Sun <tesun@nvidia.com>: LHS deadlock avoidance Merging this change closes #19026 FUTURE_COPYBARA_INTEGRATE_REVIEW=#19026 from terryysun:terryysun/overlapping_collectives 09ce310 PiperOrigin-RevId: 696020313
…rces Imported from GitHub PR openxla/xla#19026 With openxla/xla#17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend. Copybara import of the project: -- 09ce310cc14f763ec8a0a711b1787f7798122d4c by Terry Sun <tesun@nvidia.com>: LHS deadlock avoidance Merging this change closes #19026 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#19026 from terryysun:terryysun/overlapping_collectives 09ce310cc14f763ec8a0a711b1787f7798122d4c PiperOrigin-RevId: 696020313
09ce310
to
14362ea
Compare
Thanks for the suggestion @seherellis! I refactored the code with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant you don't need any of the changes in the latency_hiding_scheduler.cc/h (neither the past ones nor the current ones). If an instruction is resource constrained, the existing scheduling_node_crosses_overlap_limit
will already trigger to be true and that node will be skipped, so ReadySetLt will not be called on that node.
I don't understand why you need to change the existing rules to enforce that some resources can have at most one instance/occupier in flight. It is already there. Just set the limit to 1 so that scheduling_node_crosses_overlap_limit
can take care of it. It seems like you are trying the reinvent the wheel.
The goal of the changes in this PR is to avoid deadlocks when we allow multiple collectives to overlap, it's not blocking all cases when # available resources > 1, just some cases where we know there can be deadlocks. |
I understand you don't want to overlap certain collectives together, is this correct?
Therefore, you don't need to change anything in the existing rules. We can set up a meeting if it is still not clear. |
The resource limit is 2 https://github.com/openxla/xla/pull/19026/files#diff-4d02a4c828453f0e66c4280dc9a054f0536a1f3fefcd41d298af0d6c182b5a8eR423 |
Why aren't you setting the limit to 1 if you don't want them to overlap? |
We want multiple collectives to be able to overlap. The unit test is a case when such overlapping can lead to deadlock so we want to prevent it. In practice we can expect more cases where they should be allowed to overlap. |
I see. Can you
Then you can set the limit on each rank to be 1 and the existing rules will work correctly without any modification. |
No, fundamentally it's not a constrain on any single GPU. For example, say we have two collective with replica groups as blow: collective_0: {0, 1}, collective_1: {0, 1} -- NOT OK to overlap and then another two collectives with replica groups below: collective_2: {1, 2}, collective_3: {1, 2} -- NOT OK to overlap There's no way to control the behavior with limit here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, now I understand what you are doing. I am okay with the latest changes in latency_hiding_scheduler.h/cc.
@golechwierowicz I leave the rest of the review to you. Please check my other comment about style preferences.
// Number of overlapping ranks between this occupier and candidate | ||
size_t overlapping_count = CountOverlappingRanks( | ||
curr_start_inst->replica_groups(), occupier->replica_groups()); | ||
// VLOG(0) << "overlapping_count: " << overlapping_count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to be forgotten.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! fixed
} | ||
} | ||
|
||
if (node->GetResources().size() == 0) return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we always prefer the following style:
if (condition) {
return false;
}
@golechwierowicz I'm not sure how you check and communicate our style preferences to external contributors. I leave this you. I'm also seeing different integer types (int, int64_t, size_t), just FYI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a linter/formatter I can run to get some warnings on these style preference? I'd love to follow the preferred style
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw in case this can save people a minute, I looked into https://google.github.io/styleguide/cppguide.html and here's what I found:
regarding curly braces in branching statements:
// OK - braces are optional in this case.
if (x == kFoo) return new Foo();
// OK - condition fits on one line, body fits on another.
if (x == kBar)
Bar(arg1, arg2, arg3);
Google C++ Style Guide allows omitting curly braces for short statements. Nonetheless, I updated it to using curly braces as above are said to be exceptions.
regarding integer types:
The standard library header <cstdint> defines types like int16_t, uint32_t, int64_t, etc. You should always use those in preference to short, unsigned long long and the like, when you need a guarantee on the size of an integer. Prefer to omit the std:: prefix for these types, as the extra 5 characters do not merit the added clutter. Of the built-in integer types, only int should be used. When appropriate, you are welcome to use standard type aliases like size_t and ptrdiff_t.
We use int very often, for integers we know are not going to be too big, e.g., loop counters. Use plain old int for such things. You should assume that an int is at least 32 bits, but don't assume that it has more than 32 bits. If you need a 64-bit integer type, use int64_t or uint64_t.
I believe the usages of int/int64_t/size_t in this PR comply with the style guide, but I'm open to suggestions in case you guys have higher standard internally.
With #17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend.