[JAX] Support Ring Attention (Context Parallelism) #1059

mingxu1067 · 2024-07-30T17:57:38Z

Description

Add ring attention as an additional context parallel strategy to Jax fused attention API.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Adds a new context parallel strategy and unit tests.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

mgoldfarb-nvidia · 2024-11-05T22:19:37Z

/te-ci jax L1

mgoldfarb-nvidia · 2024-11-05T22:23:07Z

transformer_engine/jax/cpp_extensions/attention.py

+
+    @staticmethod
+    @cache
+    def use_scanloop():


This is a bit ugly but a necessity of using scan loop which is preferred as it gives XLA more control over unrolling. We won't prevent not using scan but it currently warns the user to update their XLA flags. NVTE_FUSED_RING_ATTENTION_USE_SCAN if undefined will default to 1.

Hopefully --xla_experimental_ignore_channel_id becomes default in XLA:GPU at some point.

mgoldfarb-nvidia · 2024-11-05T22:24:35Z

transformer_engine/jax/cpp_extensions/attention.py

+
+            # Combine KV tensors if separate for better permute scheduling and performance.
+            # Eventually XLA should perform this automatically.
+            kv = helper.stack_kv(k, v)


XLA:GPU does a poor job scheduling 2+ permutes with overlap so we manually combine K and V into our combined KV format. There is a plan to add functionality into XLA to do this fusion automatically.

zlsh80826

Thank you for the impressive work! I’ll need a few days to complete my review of the partition details within the ring attention.

transformer_engine/jax/cpp_extensions/attention.py

transformer_engine/jax/cpp_extensions/misc.py

transformer_engine/jax/cpp_extensions/attention.py

mgoldfarb-nvidia · 2024-11-08T00:00:40Z

/te-ci jax L1

Signed-off-by: Michael Goldfarb <mgoldfarb@nvidia.com> Signed-off-by: Ming Huang <mingh@nvidia.com>

mgoldfarb-nvidia · 2024-11-08T21:16:44Z

/te-ci jax L1

for more information, see https://pre-commit.ci

zlsh80826

LGTM!

zlsh80826 · 2024-11-09T12:17:00Z

transformer_engine/jax/cpp_extensions/attention.py

+                        return lax.cond((idx <= cp_rank), no_mask_compute, skip_compute)
+
+                    output_per_step, softmax_aux_per_step = lax.cond(
+                        idx == 0, causal_mask_compute, jax_cond_wrap


Could you clarify why causal computation is applied to idx == 0? I initially thought that the causal mask would be needed for the trailing block instead.

Here is a picture of what happens in both unbalanced and load balanced case:

Each GPU_i starts with Q_i and KV_i at the start. These are the causal masked parts along the diagonal. All subsequence iterations are either skipped or non-masked partial computations. In the load balanced case this is where the "half KV" and "half Q" cases come from.

zlsh80826 · 2024-11-09T12:21:53Z

transformer_engine/jax/cpp_extensions/attention.py

+            for idx in range(cp_size):
+                output = output + output_per_steps[idx].astype(jnp.float32) * jnp.exp(
+                    softmax_aux_per_steps[idx] - softmax_aux
+                ).transpose(0, 2, 1, 3)


Would it be feasible to transpose softmax_aux inside scan_block to allow for pipelining the transpose?

If it doesn't affect the performance lots we can keep the transpose here

mingxu1067 marked this pull request as draft July 30, 2024 17:57

mingxu1067 force-pushed the mingh/ring_attn_primitive branch 2 times, most recently from a42f6a2 to 7a147c8 Compare July 30, 2024 20:24

ptrendx mentioned this pull request Aug 15, 2024

[JAX] Context Parallel Attention with All-Gather #1106

Merged

13 tasks

mgoldfarb-nvidia force-pushed the mingh/ring_attn_primitive branch 7 times, most recently from f50f6f0 to 9fb3dc3 Compare November 5, 2024 22:07

mgoldfarb-nvidia marked this pull request as ready for review November 5, 2024 22:09

mgoldfarb-nvidia requested a review from zlsh80826 November 5, 2024 22:09

mgoldfarb-nvidia force-pushed the mingh/ring_attn_primitive branch from 9fb3dc3 to c67cc11 Compare November 5, 2024 22:19

mgoldfarb-nvidia requested a review from phu0ngng November 5, 2024 22:20

mgoldfarb-nvidia reviewed Nov 5, 2024

View reviewed changes

zlsh80826 reviewed Nov 7, 2024

View reviewed changes

transformer_engine/jax/cpp_extensions/attention.py Outdated Show resolved Hide resolved

mgoldfarb-nvidia force-pushed the mingh/ring_attn_primitive branch 5 times, most recently from e35fd0d to 51922dc Compare November 7, 2024 23:28

Implement ring attention primative for Jax.

eba6d4e

Signed-off-by: Michael Goldfarb <mgoldfarb@nvidia.com> Signed-off-by: Ming Huang <mingh@nvidia.com>

mgoldfarb-nvidia force-pushed the mingh/ring_attn_primitive branch from d04ef7f to eba6d4e Compare November 8, 2024 21:16

[pre-commit.ci] auto fixes from pre-commit.com hooks

308992b

for more information, see https://pre-commit.ci

mgoldfarb-nvidia requested a review from zlsh80826 November 8, 2024 21:16

zlsh80826 approved these changes Nov 9, 2024

View reviewed changes

mgoldfarb-nvidia merged commit bfddb48 into NVIDIA:main Nov 11, 2024
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JAX] Support Ring Attention (Context Parallelism) #1059

[JAX] Support Ring Attention (Context Parallelism) #1059

mingxu1067 commented Jul 30, 2024 •

edited by mgoldfarb-nvidia

Loading

mgoldfarb-nvidia commented Nov 5, 2024

mgoldfarb-nvidia Nov 5, 2024

mgoldfarb-nvidia Nov 5, 2024

zlsh80826 left a comment

mgoldfarb-nvidia commented Nov 8, 2024

mgoldfarb-nvidia commented Nov 8, 2024

zlsh80826 left a comment

zlsh80826 Nov 9, 2024

mgoldfarb-nvidia Nov 11, 2024

zlsh80826 Nov 9, 2024

zlsh80826 Nov 9, 2024

[JAX] Support Ring Attention (Context Parallelism) #1059

[JAX] Support Ring Attention (Context Parallelism) #1059

Conversation

mingxu1067 commented Jul 30, 2024 • edited by mgoldfarb-nvidia Loading

Description

Type of change

Changes

Checklist:

mgoldfarb-nvidia commented Nov 5, 2024

mgoldfarb-nvidia Nov 5, 2024

Choose a reason for hiding this comment

mgoldfarb-nvidia Nov 5, 2024

Choose a reason for hiding this comment

zlsh80826 left a comment

Choose a reason for hiding this comment

mgoldfarb-nvidia commented Nov 8, 2024

mgoldfarb-nvidia commented Nov 8, 2024

zlsh80826 left a comment

Choose a reason for hiding this comment

zlsh80826 Nov 9, 2024

Choose a reason for hiding this comment

mgoldfarb-nvidia Nov 11, 2024

Choose a reason for hiding this comment

zlsh80826 Nov 9, 2024

Choose a reason for hiding this comment

zlsh80826 Nov 9, 2024

Choose a reason for hiding this comment

mingxu1067 commented Jul 30, 2024 •

edited by mgoldfarb-nvidia

Loading