-
Notifications
You must be signed in to change notification settings - Fork 327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PyTorch] Reduce the CPU overheads of GroupedLinear
#1072
[PyTorch] Reduce the CPU overheads of GroupedLinear
#1072
Conversation
c163caa
to
c0cf5a9
Compare
Signed-off-by: Xin Yao <xiny@nvidia.com>
Signed-off-by: Xin Yao <xiny@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Xin Yao <xiny@nvidia.com>
Signed-off-by: Xin Yao <xiny@nvidia.com>
Signed-off-by: Xin Yao <xiny@nvidia.com>
Signed-off-by: Xin Yao <xiny@nvidia.com>
354f634
to
3a9d2f3
Compare
for more information, see https://pre-commit.ci
@timmoon10 Can you help review this PR? |
/te-ci pytorch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. What kinds of speedups are you seeing? It would be helpful to see what optimizations had the biggest impact so we can apply it to other PyTorch modules.
Here're some numbers for
|
Signed-off-by: Xin Yao <xiny@nvidia.com>
/te-ci pytorch |
Description
Try to reduce the CPU overheads of
GroupedLinear
by:fused_multi_cast_transpose
instead of iteratingcast_transpose_fused
in a for-loop.a. Changed the API of
fused_multi_cast_transpose
to avoid index-select ops in PyTorch.b. Allocate output tensors in CPP.
at::cuda::current_device()
, which has a cache, to get current device id to avoidcudaGetDriverEntryPoint
calls.torch.Tensor()
calls.Fix:
grad_bias
infused_cast_transpose_bgrad
when input is empty.Type of change
Checklist: