[NVIDIA GPU] LHS enhancement for multiple collective resources #19026

terryysun · 2024-11-04T21:39:59Z

With #17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend.

golechwierowicz

Thanks for the PR!

Sorry, I don't follow why we need to expose ReadySetLt to implement this functionality.

ReadySetLt can take the IsResourceConstrained as a function type to it's constructor (similarly as early_target_scheduling_rule_) and just use this. It will be less code and it will be more coherent with how we do things in this codebase currently.

terryysun · 2024-11-10T23:13:37Z

Thanks for the suggestion @golechwierowicz! I've refactored the code to follow the existing paradigm, could you take another look?

golechwierowicz

LGTM!

…rces Imported from GitHub PR #19026 With #17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend. Copybara import of the project: -- 09ce310 by Terry Sun <tesun@nvidia.com>: LHS deadlock avoidance Merging this change closes #19026 FUTURE_COPYBARA_INTEGRATE_REVIEW=#19026 from terryysun:terryysun/overlapping_collectives 09ce310 PiperOrigin-RevId: 696020313

…rces Imported from GitHub PR openxla/xla#19026 With openxla/xla#17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend. Copybara import of the project: -- 09ce310cc14f763ec8a0a711b1787f7798122d4c by Terry Sun <tesun@nvidia.com>: LHS deadlock avoidance Merging this change closes #19026 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#19026 from terryysun:terryysun/overlapping_collectives 09ce310cc14f763ec8a0a711b1787f7798122d4c PiperOrigin-RevId: 696020313

…rces Imported from GitHub PR #19026 With #17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend. Copybara import of the project: -- 09ce310 by Terry Sun <tesun@nvidia.com>: LHS deadlock avoidance Merging this change closes #19026 FUTURE_COPYBARA_INTEGRATE_REVIEW=#19026 from terryysun:terryysun/overlapping_collectives 09ce310 PiperOrigin-RevId: 696020313

…rces Imported from GitHub PR openxla/xla#19026 With openxla/xla#17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend. Copybara import of the project: -- 09ce310cc14f763ec8a0a711b1787f7798122d4c by Terry Sun <tesun@nvidia.com>: LHS deadlock avoidance Merging this change closes #19026 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#19026 from terryysun:terryysun/overlapping_collectives 09ce310cc14f763ec8a0a711b1787f7798122d4c PiperOrigin-RevId: 696020313

…rces Imported from GitHub PR #19026 With #17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend. Copybara import of the project: -- 09ce310 by Terry Sun <tesun@nvidia.com>: LHS deadlock avoidance Merging this change closes #19026 FUTURE_COPYBARA_INTEGRATE_REVIEW=#19026 from terryysun:terryysun/overlapping_collectives 09ce310 PiperOrigin-RevId: 696020313

seherellis · 2024-11-13T21:51:39Z

I believe none of the is_resource_constrained additions in the existing rules are needed. Why can't you use scheduling_instruction_crosses_overlap_limit instead (to prevent those instructions from being scheduled)?

…rces Imported from GitHub PR openxla/xla#19026 With openxla/xla#17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend. Copybara import of the project: -- 09ce310cc14f763ec8a0a711b1787f7798122d4c by Terry Sun <tesun@nvidia.com>: LHS deadlock avoidance Merging this change closes #19026 Reverts 2a9cc1c FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#19026 from terryysun:terryysun/overlapping_collectives 09ce310cc14f763ec8a0a711b1787f7798122d4c PiperOrigin-RevId: 696020313

…rces Imported from GitHub PR #19026 With #17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend. Copybara import of the project: -- 09ce310 by Terry Sun <tesun@nvidia.com>: LHS deadlock avoidance Merging this change closes #19026 FUTURE_COPYBARA_INTEGRATE_REVIEW=#19026 from terryysun:terryysun/overlapping_collectives 09ce310 PiperOrigin-RevId: 696020313

…rces Imported from GitHub PR openxla/xla#19026 With openxla/xla#17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend. Copybara import of the project: -- 09ce310cc14f763ec8a0a711b1787f7798122d4c by Terry Sun <tesun@nvidia.com>: LHS deadlock avoidance Merging this change closes #19026 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#19026 from terryysun:terryysun/overlapping_collectives 09ce310cc14f763ec8a0a711b1787f7798122d4c PiperOrigin-RevId: 696020313

terryysun · 2024-11-14T03:08:14Z

I believe none of the is_resource_constrained additions in the existing rules are needed. Why can't you use scheduling_instruction_crosses_overlap_limit instead (to prevent those instructions from being scheduled)?

Thanks for the suggestion @seherellis! I refactored the code with scheduling_instruction_crosses_overlap_limit and it's cleaner now. @golechwierowicz could you help take another look? thanks!

seherellis

I meant you don't need any of the changes in the latency_hiding_scheduler.cc/h (neither the past ones nor the current ones). If an instruction is resource constrained, the existing scheduling_node_crosses_overlap_limit will already trigger to be true and that node will be skipped, so ReadySetLt will not be called on that node.

I don't understand why you need to change the existing rules to enforce that some resources can have at most one instance/occupier in flight. It is already there. Just set the limit to 1 so that scheduling_node_crosses_overlap_limit can take care of it. It seems like you are trying the reinvent the wheel.

terryysun · 2024-11-14T17:53:56Z

I meant you don't need any of the changes in the latency_hiding_scheduler.cc/h (neither the past ones nor the current ones). If an instruction is resource constrained, the existing scheduling_node_crosses_overlap_limit will already trigger to be true and that node will be skipped, so ReadySetLt will not be called on that node.

I don't understand why you need to change the existing rules to enforce that some resources can have at most one instance/occupier in flight. It is already there. Just set the limit to 1 so that scheduling_node_crosses_overlap_limit can take care of it. It seems like you are trying the reinvent the wheel.

The goal of the changes in this PR is to avoid deadlocks when we allow multiple collectives to overlap, it's not blocking all cases when # available resources > 1, just some cases where we know there can be deadlocks.

seherellis · 2024-11-14T18:30:05Z

I understand you don't want to overlap certain collectives together, is this correct?

scheduling_instruction_crosses_overlap_limit already should not allow scheduling the second collective's done when the first overlap is STILL open IF you define those two collectives use a common resource whose LIMIT is 1.

Therefore, you don't need to change anything in the existing rules.

We can set up a meeting if it is still not clear.

terryysun · 2024-11-14T20:26:18Z

I understand you don't want to overlap certain collectives together, is this correct?

scheduling_instruction_crosses_overlap_limit already should not allow scheduling the second collective's done when the first overlap is STILL open IF you define those two collectives use a common resource whose LIMIT is 1.

Therefore, you don't need to change anything in the existing rules.

We can set up a meeting if it is still not clear.

The resource limit is 2 https://github.com/openxla/xla/pull/19026/files#diff-4d02a4c828453f0e66c4280dc9a054f0536a1f3fefcd41d298af0d6c182b5a8eR423

seherellis · 2024-11-14T20:33:19Z

Why aren't you setting the limit to 1 if you don't want them to overlap?

terryysun · 2024-11-14T20:46:53Z

Why aren't you setting the limit to 1 if you don't want them to overlap?

We want multiple collectives to be able to overlap. The unit test is a case when such overlapping can lead to deadlock so we want to prevent it. In practice we can expect more cases where they should be allowed to overlap.

seherellis · 2024-11-14T21:05:30Z

I see. Can you

define each rank as a target-defined resource for GPU and
define those collectives as occupiers/releasers for those resources?

Then you can set the limit on each rank to be 1 and the existing rules will work correctly without any modification.

terryysun · 2024-11-14T22:36:47Z

I see. Can you

define each rank as a target-defined resource for GPU and

define those collectives as occupiers/releasers for those resources?

Then you can set the limit on each rank to be 1 and the existing rules will work correctly without any modification.

No, fundamentally it's not a constrain on any single GPU. For example, say we have two collective with replica groups as blow:

collective_0: {0, 1}, collective_1: {0, 1} -- NOT OK to overlap
collective_0: {0, 1}, collective_1: {0} -- OK to overlap

and then another two collectives with replica groups below:

collective_2: {1, 2}, collective_3: {1, 2} -- NOT OK to overlap
collective_2: {1, 2}, collective_3: {1} -- OK to overlap

There's no way to control the behavior with limit here.

seherellis

Okay, now I understand what you are doing. I am okay with the latest changes in latency_hiding_scheduler.h/cc.

@golechwierowicz I leave the rest of the review to you. Please check my other comment about style preferences.

seherellis · 2024-11-15T18:35:03Z

xla/service/gpu/gpu_latency_hiding_scheduler.cc

+          // Number of overlapping ranks between this occupier and candidate
+          size_t overlapping_count = CountOverlappingRanks(
+              curr_start_inst->replica_groups(), occupier->replica_groups());
+          // VLOG(0) << "overlapping_count: " << overlapping_count;


Seems to be forgotten.

good catch! fixed

seherellis · 2024-11-15T18:37:02Z

xla/service/gpu/gpu_latency_hiding_scheduler.cc

+    }
+  }
+
+  if (node->GetResources().size() == 0) return false;


I think we always prefer the following style:

if (condition) { return false; }

@golechwierowicz I'm not sure how you check and communicate our style preferences to external contributors. I leave this you. I'm also seeing different integer types (int, int64_t, size_t), just FYI.

is there a linter/formatter I can run to get some warnings on these style preference? I'd love to follow the preferred style

btw in case this can save people a minute, I looked into https://google.github.io/styleguide/cppguide.html and here's what I found:

regarding curly braces in branching statements:

// OK - braces are optional in this case. if (x == kFoo) return new Foo(); // OK - condition fits on one line, body fits on another. if (x == kBar) Bar(arg1, arg2, arg3);

Google C++ Style Guide allows omitting curly braces for short statements. Nonetheless, I updated it to using curly braces as above are said to be exceptions.

regarding integer types:

The standard library header <cstdint> defines types like int16_t, uint32_t, int64_t, etc. You should always use those in preference to short, unsigned long long and the like, when you need a guarantee on the size of an integer. Prefer to omit the std:: prefix for these types, as the extra 5 characters do not merit the added clutter. Of the built-in integer types, only int should be used. When appropriate, you are welcome to use standard type aliases like size_t and ptrdiff_t. We use int very often, for integers we know are not going to be too big, e.g., loop counters. Use plain old int for such things. You should assume that an int is at least 32 bits, but don't assume that it has more than 32 bits. If you need a 64-bit integer type, use int64_t or uint64_t.

I believe the usages of int/int64_t/size_t in this PR comply with the style guide, but I'm open to suggestions in case you guys have higher standard internally.

reedwm requested a review from golechwierowicz November 5, 2024 00:33

golechwierowicz suggested changes Nov 5, 2024

View reviewed changes

terryysun force-pushed the terryysun/overlapping_collectives branch from f839d6e to 7ba2915 Compare November 10, 2024 23:11

terryysun requested a review from golechwierowicz November 10, 2024 23:13

terryysun force-pushed the terryysun/overlapping_collectives branch from 7ba2915 to 09ce310 Compare November 11, 2024 00:35

golechwierowicz approved these changes Nov 13, 2024

View reviewed changes

copybara-service bot mentioned this pull request Nov 13, 2024

PR #19026: [NVIDIA GPU] LHS enhancement for multiple collective resources #19289

Open

copybara-service bot mentioned this pull request Nov 13, 2024

PR #19026: [NVIDIA GPU] LHS enhancement for multiple collective resources tensorflow/tensorflow#79939

Draft

seherellis self-requested a review November 13, 2024 21:50

terryysun force-pushed the terryysun/overlapping_collectives branch from 09ce310 to 14362ea Compare November 14, 2024 03:06

LHS deadlock avoidance

14362ea

terryysun requested a review from golechwierowicz November 14, 2024 03:08

seherellis suggested changes Nov 14, 2024

View reviewed changes

seherellis reviewed Nov 15, 2024

View reviewed changes

minor fix

e027794

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVIDIA GPU] LHS enhancement for multiple collective resources #19026

[NVIDIA GPU] LHS enhancement for multiple collective resources #19026

terryysun commented Nov 4, 2024

golechwierowicz left a comment

terryysun commented Nov 10, 2024

golechwierowicz left a comment

seherellis commented Nov 13, 2024

terryysun commented Nov 14, 2024

seherellis left a comment

terryysun commented Nov 14, 2024

seherellis commented Nov 14, 2024

terryysun commented Nov 14, 2024

seherellis commented Nov 14, 2024

terryysun commented Nov 14, 2024

seherellis commented Nov 14, 2024

terryysun commented Nov 14, 2024

seherellis left a comment

seherellis Nov 15, 2024

terryysun Nov 15, 2024 •

edited

Loading

seherellis Nov 15, 2024

terryysun Nov 15, 2024 •

edited

Loading

terryysun Nov 15, 2024

[NVIDIA GPU] LHS enhancement for multiple collective resources #19026

Are you sure you want to change the base?

[NVIDIA GPU] LHS enhancement for multiple collective resources #19026

Conversation

terryysun commented Nov 4, 2024

golechwierowicz left a comment

Choose a reason for hiding this comment

terryysun commented Nov 10, 2024

golechwierowicz left a comment

Choose a reason for hiding this comment

seherellis commented Nov 13, 2024

terryysun commented Nov 14, 2024

seherellis left a comment

Choose a reason for hiding this comment

terryysun commented Nov 14, 2024

seherellis commented Nov 14, 2024

terryysun commented Nov 14, 2024

seherellis commented Nov 14, 2024

terryysun commented Nov 14, 2024

seherellis commented Nov 14, 2024

terryysun commented Nov 14, 2024

seherellis left a comment

Choose a reason for hiding this comment

seherellis Nov 15, 2024

Choose a reason for hiding this comment

terryysun Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

seherellis Nov 15, 2024

Choose a reason for hiding this comment

terryysun Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

terryysun Nov 15, 2024

Choose a reason for hiding this comment

terryysun Nov 15, 2024 •

edited

Loading

terryysun Nov 15, 2024 •

edited

Loading