You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
we have been noticing some pretty large deviations between the attention output of flash/unfused attention versus the fused attention kernels when sliding window attention is active. The following sample illustrates this:
Thanks for raising this issue. I seem to have overlooked the different window_size definition in cuDNN. cuDNN supports sliding window (i - window_size_left, i], exclusive of the i - window_size_left element, whereas the original paper, flash-attn and TE unfused DPA have used the definition of [i - window_size_left, i + window_size_right], which is inclusive of the boundary elements. Please give #1212 a try and let me know if there's still any issues. Thanks!
Results:
diff flash vs unfused: 0.0330810546875
diff fused vs unfused: 0.033203125
diff flash vs fused: 0.001953125
Hello Charlene,
oh I see, that makes sense. I just tested your fix in our training environment and I can confirm that the issue is fixed.
Thanks for looking into this so quickly and sharing the details 🥳
Hello team,
we have been noticing some pretty large deviations between the attention output of flash/unfused attention versus the fused attention kernels when sliding window attention is active. The following sample illustrates this:
The output we see on H100 and CUDA 12.5 with CUDNN 9.2.1 is:
The later one seems rather large. Can you reproduce these results?
The text was updated successfully, but these errors were encountered: