-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
linear_int4_kernel for XPU #1130
base: main
Are you sure you want to change the base?
Conversation
Reset to bfdbaf4 --------- Co-authored-by: mengfei25 <mengfei.li@Intel.com> Co-authored-by: LuFengqing <fengqing.lu@intel.com> Co-authored-by: Ratnam Parikh <114774508+ratnampa@users.noreply.github.com> Co-authored-by: Feng Yuan <feng1.yuan@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the biggest question should be why we need post op fusion here? does pytorch have it with cuda?
@liangan1 CC |
@sunjiweiswift for the perf benchmarking, please include other configs expect M=1. This would serve as a reference of final decision making. I expect that big M would have worse perf, but that's fine, we still need to know the numbers. |
#### Bugfix - [add lazy init for empty_xpu](#1115) - [nan propagation for soft_shrink](https://github.com/intel/torch-xpu-ops/pull/1116/files#diff-b7cb5876d000db957286c8b0e72badb2b7502402c8955334f1cc21c34c98a5b9) --------- Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com> Co-authored-by: ZhiweiYan-96 <zhiwei.yan@intel.com>
faa79b7
to
5a08d2e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally LGTM. good job:)
for (int i = 0; i < k; i += GroupK * Unroll) { | ||
#pragma unroll | ||
for (int iu = 0; iu < Unroll; iu++) { | ||
uint8_t tmps8[TileK / 2]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can do a little template trick to simply this piece of logic, have a template that handles all scernios and then pass corresponding args when called.
template <typename scalar_t, int SgSize, int TileK, int Unroll>
void tinygemm_kernel(...)
if (k % (SgSize * 32 * Unroll) == 0) {
// use tinygemm_kernel<...>
else {
// use tinygemm_kernel<...>
}
not a must to have, just a little trick.
@xytintel not only this PR but the latest several CI all failed, could you check?
|
@EikanWang @liangan1 thoughts? |
Pure SYCL path for. int4 gemm
Benchmark results on PVC-1100. The remaining gaps are lack of usage of 2D load.
Besides PVC, the kernel can achieve
92.7% bandwidth usage on MTL
84.7% bandwidth usage on A750