-
Notifications
You must be signed in to change notification settings - Fork 521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable AMD BF16 Grouped Gemm #3526
Conversation
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
This pull request was exported from Phabricator. Differential Revision: D67261862 |
This pull request was exported from Phabricator. Differential Revision: D67261862 |
6ae3c7a
to
26ab8e8
Compare
Summary: Pull Request resolved: pytorch#3526 X-link: facebookresearch/FBGEMM#608 Implementation of CK based BF16 Grouped Gemm. Currently performance is quite poor :( Reviewed By: zjing14 Differential Revision: D67261862
Summary: This diff cleans up some of the APIs for FBGEMM grouped gemm and updates CUTLASS bf16 grouped gemm to use a single kernel launch to initialize gemm arguments. This should help reduce overhead a bit. The only notable API change exposed to the user is that all grouped gemm functions now return lists of outputs where bf16 previously returned a single tensor blob. This does mean that in some cases we'll have to do an extra `torch.stack` to unify the groups. If this turns out to be costly, I think we can instead have two grouped gemm implementations, one for dynamic (which returns a single tensor) and one for static shapes which returns a list of tensors. Differential Revision: D67423469 Reviewed By: jianyuh
Summary: Pull Request resolved: pytorch#3526 X-link: facebookresearch/FBGEMM#608 Implementation of CK based BF16 Grouped Gemm. Currently performance is quite poor :( Reviewed By: zjing14 Differential Revision: D67261862
26ab8e8
to
f2823db
Compare
This pull request was exported from Phabricator. Differential Revision: D67261862 |
Summary: Pull Request resolved: pytorch#3526 X-link: facebookresearch/FBGEMM#608 Implementation of CK based BF16 Grouped Gemm. Currently performance is quite poor :( Differential Revision: D67261862 Reviewed By: zjing14
Summary: Pull Request resolved: pytorch#3526 X-link: facebookresearch/FBGEMM#608 Implementation of CK based BF16 Grouped Gemm. Currently performance is quite poor :( Reviewed By: zjing14 Differential Revision: D67261862
f2823db
to
df2c145
Compare
This pull request was exported from Phabricator. Differential Revision: D67261862 |
Summary: Pull Request resolved: pytorch#3526 X-link: facebookresearch/FBGEMM#608 Implementation of CK based BF16 Grouped Gemm. Currently performance is quite poor :( Reviewed By: zjing14 Differential Revision: D67261862
df2c145
to
825c2cd
Compare
This pull request was exported from Phabricator. Differential Revision: D67261862 |
Summary: Pull Request resolved: pytorch#3526 X-link: facebookresearch/FBGEMM#608 Implementation of CK based BF16 Grouped Gemm. Currently performance is quite poor :( Reviewed By: zjing14 Differential Revision: D67261862
825c2cd
to
fe36ce2
Compare
This pull request was exported from Phabricator. Differential Revision: D67261862 |
Summary: Pull Request resolved: pytorch#3526 X-link: facebookresearch/FBGEMM#608 Implementation of CK based BF16 Grouped Gemm. Currently performance is quite poor :( Reviewed By: zjing14 Differential Revision: D67261862
fe36ce2
to
92e2946
Compare
This pull request was exported from Phabricator. Differential Revision: D67261862 |
This pull request has been merged in 4c0d4f7. |
Summary: Implementation of CK based BF16 Grouped Gemm. Currently performance is quite poor :(
Reviewed By: zjing14
Differential Revision: D67261862