[PyTorch] fused CUDNN attention kernel and sliding window attention #1197

Marks101 · 2024-09-23T15:02:55Z

Hello team,

we have been noticing some pretty large deviations between the attention output of flash/unfused attention versus the fused attention kernels when sliding window attention is active. The following sample illustrates this:

import torch

from transformer_engine.pytorch.attention import FlashAttention, FusedAttention, UnfusedDotProductAttention, get_swa_mask
from transformer_engine.common.recipe import DelayedScaling
import transformer_engine_torch as tex

window_size = (1024, 0)
seqlen, num_heads, kv_channels = 2048, 64, 64

q, k, v = [torch.randn(seqlen, 1, num_heads, kv_channels, dtype=torch.float16, device="cuda") for _ in range(3)]

flash_attn = FlashAttention(1.0)
fused_attn = FusedAttention(1.0)
unfused_attn = UnfusedDotProductAttention(1.0)

output_flash = flash_attn(q, k, v, "sbhd_sbhd_sbhd", window_size=window_size)
output_fused = fused_attn(q, k, v, "sbhd_sbhd_sbhd",
                          fused_attention_backend=tex.NVTE_Fused_Attn_Backend.NVTE_F16_arbitrary_seqlen,
                          window_size=window_size,
                          fp8_meta=dict(recipe=DelayedScaling()))

attention_mask = torch.ones(1, 1, seqlen, seqlen, dtype=torch.bool, device="cuda")
attn_mask_type, attention_mask = get_swa_mask(window_size, seqlen, seqlen, "causal", attention_mask)
output_unfused = unfused_attn(q, k, v, attn_mask_type=attn_mask_type, attention_mask=attention_mask)

print("diff flash vs unfused:", torch.max(torch.abs(output_flash - output_unfused)).item())
print("diff fused vs unfused:", torch.max(torch.abs(output_fused - output_unfused)).item())

The output we see on H100 and CUDA 12.5 with CUDNN 9.2.1 is:

diff flash vs unfused: 0.03076171875
diff fused vs unfused: 4.8828125

The later one seems rather large. Can you reproduce these results?

The text was updated successfully, but these errors were encountered:

ksivaman · 2024-09-23T15:12:00Z

@cyanguwa Do you know what could be causing this?

cyanguwa · 2024-09-27T22:21:02Z

Hi @Marks101 ,

Thanks for raising this issue. I seem to have overlooked the different window_size definition in cuDNN. cuDNN supports sliding window (i - window_size_left, i], exclusive of the i - window_size_left element, whereas the original paper, flash-attn and TE unfused DPA have used the definition of [i - window_size_left, i + window_size_right], which is inclusive of the boundary elements. Please give #1212 a try and let me know if there's still any issues. Thanks!

Results:

diff flash vs unfused: 0.0330810546875
diff fused vs unfused: 0.033203125
diff flash vs   fused: 0.001953125

Marks101 · 2024-09-30T09:41:12Z

Hello Charlene,
oh I see, that makes sense. I just tested your fix in our training environment and I can confirm that the issue is fixed.
Thanks for looking into this so quickly and sharing the details 🥳

ksivaman assigned cyanguwa Sep 23, 2024

cyanguwa mentioned this issue Sep 27, 2024

Fix cuDNN sliding window size #1212

Merged

13 tasks

cyanguwa closed this as completed Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] fused CUDNN attention kernel and sliding window attention #1197

[PyTorch] fused CUDNN attention kernel and sliding window attention #1197

Marks101 commented Sep 23, 2024

ksivaman commented Sep 23, 2024

cyanguwa commented Sep 27, 2024

Marks101 commented Sep 30, 2024

[PyTorch] fused CUDNN attention kernel and sliding window attention #1197

[PyTorch] fused CUDNN attention kernel and sliding window attention #1197

Comments

Marks101 commented Sep 23, 2024

ksivaman commented Sep 23, 2024

cyanguwa commented Sep 27, 2024

Marks101 commented Sep 30, 2024