How to use var_len attention in context parallel? #1080

liuzhaowen1218 · 2024-08-06T09:39:00Z

I am doing a 128k sequence length SFT task revolving many pad tokens in training tokens which may lead to abnormal loss by irrelevant attention. When not using Context Parallel, this problem can be fixed by customizing cu_seqlens_q and cu_seqlens_kv as parameters sending into flash_attn_varlen_func. How can I get the same result by using Context Parallel?

TransformerEngine/transformer_engine/pytorch/attention.py

Line 2504 in 27c6342

or attn_mask_type in ["padding", "padding_causal"]

xrennvidia · 2024-08-06T17:28:53Z

Hi @liuzhaowen1218

You can refer our TE unit test.

You need to make sure each individual sequence length is divisible by CPx2. If your running does not meeting this requirements, you need to pad a few tokens to each sequence so that their lengths are divisible by CPx2. Then you can split your input across GPUs like this.

Since you may need to do some padding for each sequence, you possibly will have some padded tokens between sequences. Hence you not only need to handle cu_seqlens, you also need to handle cu_seqlens_padded (refer here). Padding between sequences is only supported with FusedAttention, not with FlashAttention.

ptrendx assigned cyanguwa Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use var_len attention in context parallel? #1080

How to use var_len attention in context parallel? #1080

liuzhaowen1218 commented Aug 6, 2024

xrennvidia commented Aug 6, 2024

How to use var_len attention in context parallel? #1080

How to use var_len attention in context parallel? #1080

Comments

liuzhaowen1218 commented Aug 6, 2024

xrennvidia commented Aug 6, 2024