Improve performance of prefill mode FP8 Grouped Gemm #3522

jwfromm · 2024-12-20T19:15:29Z

Summary: I previously assumed that using hipmemcpy would be more efficient than launching many kernels that directly set gpu memory. This assumption is apparently (and very surprisingly) untrue. It seems the the multi-kernel-launch approach reduces overhead considerably, giving a 10% speedup.

Differential Revision: D67531231

facebook-github-bot · 2024-12-20T19:15:38Z

This pull request was exported from Phabricator. Differential Revision: D67531231

netlify · 2024-12-20T19:15:48Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`47b6396`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67733c6fe95d5b0008e12fdc
😎 Deploy Preview	https://deploy-preview-3522--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Summary: X-link: facebookresearch/FBGEMM#603 I previously assumed that using hipmemcpy would be more efficient than launching many kernels that directly set gpu memory. This assumption is apparently (and very surprisingly) untrue. It seems the the multi-kernel-launch approach reduces overhead considerably, giving a 10% speedup. Differential Revision: D67531231

facebook-github-bot · 2024-12-20T21:49:42Z

This pull request was exported from Phabricator. Differential Revision: D67531231

Summary: Pull Request resolved: pytorch#3522 X-link: facebookresearch/FBGEMM#603 I previously assumed that using hipmemcpy would be more efficient than launching many kernels that directly set gpu memory. This assumption is apparently (and very surprisingly) untrue. It seems the the multi-kernel-launch approach reduces overhead considerably, giving a 10% speedup. Differential Revision: D67531231

Summary: X-link: facebookresearch/FBGEMM#603 It turns out that setting up the grouped gemm kernel arguments can be a significant overhead. This diff more carefully checks the number of groups to dispatch to either a hipmemcpy based approach, which works well when there are 16 more groups, or a series of kernels that directly sets the gpu memory for each group. For smaller number of groups, this approach provides a pretty substantial speedup. Reviewed By: jianyuh Differential Revision: D67531231

facebook-github-bot · 2024-12-31T00:36:06Z

This pull request was exported from Phabricator. Differential Revision: D67531231

facebook-github-bot · 2024-12-31T17:52:11Z

This pull request has been merged in 8a1ca16.

facebook-github-bot added the cla signed label Dec 20, 2024

facebook-github-bot added the fb-exported label Dec 20, 2024

jwfromm force-pushed the export-D67531231 branch from 71fed6a to 0322843 Compare December 20, 2024 21:49

jwfromm force-pushed the export-D67531231 branch from 0322843 to 47b6396 Compare December 31, 2024 00:35

facebook-github-bot closed this in 8a1ca16 Dec 31, 2024

facebook-github-bot added the Merged label Dec 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of prefill mode FP8 Grouped Gemm #3522

Improve performance of prefill mode FP8 Grouped Gemm #3522

jwfromm commented Dec 20, 2024

facebook-github-bot commented Dec 20, 2024

netlify bot commented Dec 20, 2024 •

edited

Loading

facebook-github-bot commented Dec 20, 2024

facebook-github-bot commented Dec 31, 2024

facebook-github-bot commented Dec 31, 2024

Improve performance of prefill mode FP8 Grouped Gemm #3522

Improve performance of prefill mode FP8 Grouped Gemm #3522

Conversation

jwfromm commented Dec 20, 2024

facebook-github-bot commented Dec 20, 2024

netlify bot commented Dec 20, 2024 • edited Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

facebook-github-bot commented Dec 20, 2024

facebook-github-bot commented Dec 31, 2024

facebook-github-bot commented Dec 31, 2024

netlify bot commented Dec 20, 2024 •

edited

Loading