-
Notifications
You must be signed in to change notification settings - Fork 327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QUESTION] Does TP overlap support variable sequence length? #1303
Comments
TP overlap currently requires sequence parallelism and does not have any attention layout/format restrictions except that the sequence length has to be constant and evenly divisible by TP size. Since you asked about the format, I want to clarify that we currently do not support comm+GEMM overlap in the attention mechanism. TP overlap is restricted to the |
Thank you very much. |
Unfortunately the current implementation does not support variable sequence lengths, so you would have to pad your sequences up to a static maximum. Theoretically there is no reason why it couldn't be done, but the custom communication kernels we use for TP overlap have far too many hard-coded assumptions about buffer and work chunk sizes to strip out easily in practice. We do plan to support this in the near future, after we migrate the TP overlap functionality to the latest cuBlasMp v0.3.0 release that introduced support for collective GEMM with overlapped communication (these are NVSHMEM-based re-implementations of the same TP overlap algorithms in Transformer Engine). |
Thanks for your great works, again. |
I hope to integrate cuBlasMp into TE by mid-December at the latest. There's a chance this might support variable sequence lengths out of the box, but otherwise it would have to wait until at least January if not later, depending on where this feature lands on our list of priorities. |
Hi, thank you for great works.
I'd like to ask if TP overlap support variable sequence length?
The text was updated successfully, but these errors were encountered: