Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NVIDIA GPU] LHS enhancement for multiple collective resources #19026

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

terryysun
Copy link
Contributor

With #17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend.

Copy link
Member

@golechwierowicz golechwierowicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Sorry, I don't follow why we need to expose ReadySetLt to implement this functionality.

ReadySetLt can take the IsResourceConstrained as a function type to it's constructor (similarly as early_target_scheduling_rule_) and just use this. It will be less code and it will be more coherent with how we do things in this codebase currently.

@terryysun terryysun force-pushed the terryysun/overlapping_collectives branch from f839d6e to 7ba2915 Compare November 10, 2024 23:11
@terryysun
Copy link
Contributor Author

Thanks for the suggestion @golechwierowicz! I've refactored the code to follow the existing paradigm, could you take another look?

Copy link
Member

@golechwierowicz golechwierowicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

copybara-service bot pushed a commit that referenced this pull request Nov 13, 2024
…rces

Imported from GitHub PR #19026

With #17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend.
Copybara import of the project:

--
09ce310 by Terry Sun <tesun@nvidia.com>:

LHS deadlock avoidance

Merging this change closes #19026

FUTURE_COPYBARA_INTEGRATE_REVIEW=#19026 from terryysun:terryysun/overlapping_collectives 09ce310
PiperOrigin-RevId: 696020313
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Nov 13, 2024
…rces

Imported from GitHub PR openxla/xla#19026

With openxla/xla#17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend.
Copybara import of the project:

--
09ce310cc14f763ec8a0a711b1787f7798122d4c by Terry Sun <tesun@nvidia.com>:

LHS deadlock avoidance

Merging this change closes #19026

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#19026 from terryysun:terryysun/overlapping_collectives 09ce310cc14f763ec8a0a711b1787f7798122d4c
PiperOrigin-RevId: 696020313
copybara-service bot pushed a commit that referenced this pull request Nov 13, 2024
…rces

Imported from GitHub PR #19026

With #17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend.
Copybara import of the project:

--
09ce310 by Terry Sun <tesun@nvidia.com>:

LHS deadlock avoidance

Merging this change closes #19026

FUTURE_COPYBARA_INTEGRATE_REVIEW=#19026 from terryysun:terryysun/overlapping_collectives 09ce310
PiperOrigin-RevId: 696020313
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Nov 13, 2024
…rces

Imported from GitHub PR openxla/xla#19026

With openxla/xla#17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend.
Copybara import of the project:

--
09ce310cc14f763ec8a0a711b1787f7798122d4c by Terry Sun <tesun@nvidia.com>:

LHS deadlock avoidance

Merging this change closes #19026

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#19026 from terryysun:terryysun/overlapping_collectives 09ce310cc14f763ec8a0a711b1787f7798122d4c
PiperOrigin-RevId: 696020313
copybara-service bot pushed a commit that referenced this pull request Nov 13, 2024
…rces

Imported from GitHub PR #19026

With #17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend.
Copybara import of the project:

--
09ce310 by Terry Sun <tesun@nvidia.com>:

LHS deadlock avoidance

Merging this change closes #19026

FUTURE_COPYBARA_INTEGRATE_REVIEW=#19026 from terryysun:terryysun/overlapping_collectives 09ce310
PiperOrigin-RevId: 696020313
@seherellis seherellis self-requested a review November 13, 2024 21:50
@seherellis
Copy link
Member

I believe none of the is_resource_constrained additions in the existing rules are needed. Why can't you use scheduling_instruction_crosses_overlap_limit instead (to prevent those instructions from being scheduled)?

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Nov 13, 2024
…rces

Imported from GitHub PR openxla/xla#19026

With openxla/xla#17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend.
Copybara import of the project:

--
09ce310cc14f763ec8a0a711b1787f7798122d4c by Terry Sun <tesun@nvidia.com>:

LHS deadlock avoidance

Merging this change closes #19026

Reverts 2a9cc1c

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#19026 from terryysun:terryysun/overlapping_collectives 09ce310cc14f763ec8a0a711b1787f7798122d4c
PiperOrigin-RevId: 696020313
copybara-service bot pushed a commit that referenced this pull request Nov 13, 2024
…rces

Imported from GitHub PR #19026

With #17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend.
Copybara import of the project:

--
09ce310 by Terry Sun <tesun@nvidia.com>:

LHS deadlock avoidance

Merging this change closes #19026

FUTURE_COPYBARA_INTEGRATE_REVIEW=#19026 from terryysun:terryysun/overlapping_collectives 09ce310
PiperOrigin-RevId: 696020313
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Nov 13, 2024
…rces

Imported from GitHub PR openxla/xla#19026

With openxla/xla#17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend.
Copybara import of the project:

--
09ce310cc14f763ec8a0a711b1787f7798122d4c by Terry Sun <tesun@nvidia.com>:

LHS deadlock avoidance

Merging this change closes #19026

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#19026 from terryysun:terryysun/overlapping_collectives 09ce310cc14f763ec8a0a711b1787f7798122d4c
PiperOrigin-RevId: 696020313
@terryysun terryysun force-pushed the terryysun/overlapping_collectives branch from 09ce310 to 14362ea Compare November 14, 2024 03:06
@terryysun
Copy link
Contributor Author

I believe none of the is_resource_constrained additions in the existing rules are needed. Why can't you use scheduling_instruction_crosses_overlap_limit instead (to prevent those instructions from being scheduled)?

Thanks for the suggestion @seherellis! I refactored the code with scheduling_instruction_crosses_overlap_limit and it's cleaner now. @golechwierowicz could you help take another look? thanks!

Copy link
Member

@seherellis seherellis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant you don't need any of the changes in the latency_hiding_scheduler.cc/h (neither the past ones nor the current ones). If an instruction is resource constrained, the existing scheduling_node_crosses_overlap_limit will already trigger to be true and that node will be skipped, so ReadySetLt will not be called on that node.

I don't understand why you need to change the existing rules to enforce that some resources can have at most one instance/occupier in flight. It is already there. Just set the limit to 1 so that scheduling_node_crosses_overlap_limit can take care of it. It seems like you are trying the reinvent the wheel.

@terryysun
Copy link
Contributor Author

I meant you don't need any of the changes in the latency_hiding_scheduler.cc/h (neither the past ones nor the current ones). If an instruction is resource constrained, the existing scheduling_node_crosses_overlap_limit will already trigger to be true and that node will be skipped, so ReadySetLt will not be called on that node.

I don't understand why you need to change the existing rules to enforce that some resources can have at most one instance/occupier in flight. It is already there. Just set the limit to 1 so that scheduling_node_crosses_overlap_limit can take care of it. It seems like you are trying the reinvent the wheel.

The goal of the changes in this PR is to avoid deadlocks when we allow multiple collectives to overlap, it's not blocking all cases when # available resources > 1, just some cases where we know there can be deadlocks.

@seherellis
Copy link
Member

I understand you don't want to overlap certain collectives together, is this correct?

scheduling_instruction_crosses_overlap_limit already should not allow scheduling the second collective's done when the first overlap is STILL open IF you define those two collectives use a common resource whose LIMIT is 1.

Therefore, you don't need to change anything in the existing rules.

We can set up a meeting if it is still not clear.

@terryysun
Copy link
Contributor Author

I understand you don't want to overlap certain collectives together, is this correct?

scheduling_instruction_crosses_overlap_limit already should not allow scheduling the second collective's done when the first overlap is STILL open IF you define those two collectives use a common resource whose LIMIT is 1.

Therefore, you don't need to change anything in the existing rules.

We can set up a meeting if it is still not clear.

The resource limit is 2 https://github.com/openxla/xla/pull/19026/files#diff-4d02a4c828453f0e66c4280dc9a054f0536a1f3fefcd41d298af0d6c182b5a8eR423

@seherellis
Copy link
Member

Why aren't you setting the limit to 1 if you don't want them to overlap?

@terryysun
Copy link
Contributor Author

Why aren't you setting the limit to 1 if you don't want them to overlap?

We want multiple collectives to be able to overlap. The unit test is a case when such overlapping can lead to deadlock so we want to prevent it. In practice we can expect more cases where they should be allowed to overlap.

@seherellis
Copy link
Member

I see. Can you

  • define each rank as a target-defined resource for GPU and
  • define those collectives as occupiers/releasers for those resources?

Then you can set the limit on each rank to be 1 and the existing rules will work correctly without any modification.

@terryysun
Copy link
Contributor Author

I see. Can you

  • define each rank as a target-defined resource for GPU and
  • define those collectives as occupiers/releasers for those resources?

Then you can set the limit on each rank to be 1 and the existing rules will work correctly without any modification.

No, fundamentally it's not a constrain on any single GPU. For example, say we have two collective with replica groups as blow:

collective_0: {0, 1}, collective_1: {0, 1} -- NOT OK to overlap
collective_0: {0, 1}, collective_1: {0} -- OK to overlap

and then another two collectives with replica groups below:

collective_2: {1, 2}, collective_3: {1, 2} -- NOT OK to overlap
collective_2: {1, 2}, collective_3: {1} -- OK to overlap

There's no way to control the behavior with limit here.

Copy link
Member

@seherellis seherellis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, now I understand what you are doing. I am okay with the latest changes in latency_hiding_scheduler.h/cc.

@golechwierowicz I leave the rest of the review to you. Please check my other comment about style preferences.

// Number of overlapping ranks between this occupier and candidate
size_t overlapping_count = CountOverlappingRanks(
curr_start_inst->replica_groups(), occupier->replica_groups());
// VLOG(0) << "overlapping_count: " << overlapping_count;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be forgotten.

Copy link
Contributor Author

@terryysun terryysun Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! fixed

}
}

if (node->GetResources().size() == 0) return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we always prefer the following style:

if (condition) {
  return false;
}

@golechwierowicz I'm not sure how you check and communicate our style preferences to external contributors. I leave this you. I'm also seeing different integer types (int, int64_t, size_t), just FYI.

Copy link
Contributor Author

@terryysun terryysun Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a linter/formatter I can run to get some warnings on these style preference? I'd love to follow the preferred style

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw in case this can save people a minute, I looked into https://google.github.io/styleguide/cppguide.html and here's what I found:

regarding curly braces in branching statements:

// OK - braces are optional in this case.
if (x == kFoo) return new Foo();

// OK - condition fits on one line, body fits on another.
if (x == kBar)
  Bar(arg1, arg2, arg3);

Google C++ Style Guide allows omitting curly braces for short statements. Nonetheless, I updated it to using curly braces as above are said to be exceptions.

regarding integer types:

The standard library header <cstdint> defines types like int16_t, uint32_t, int64_t, etc. You should always use those in preference to short, unsigned long long and the like, when you need a guarantee on the size of an integer. Prefer to omit the std:: prefix for these types, as the extra 5 characters do not merit the added clutter. Of the built-in integer types, only int should be used. When appropriate, you are welcome to use standard type aliases like size_t and ptrdiff_t.

We use int very often, for integers we know are not going to be too big, e.g., loop counters. Use plain old int for such things. You should assume that an int is at least 32 bits, but don't assume that it has more than 32 bits. If you need a 64-bit integer type, use int64_t or uint64_t.

I believe the usages of int/int64_t/size_t in this PR comply with the style guide, but I'm open to suggestions in case you guys have higher standard internally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants