linear_int4_kernel for XPU #1130

sunjiweiswift · 2024-11-29T09:29:31Z

Pure SYCL path for. int4 gemm

Benchmark results on PVC-1100. The remaining gaps are lack of usage of 2D load.

M	K	N	SrcT	WeiT	DstT	Bandwidth usage (BW usage)
1	4096	4096	float16	float16	float16	53.7%
1	4096	11008	float16	float16	float16	57.4%
1	4096	16384	float16	float16	float16	59.7%
1	12288	4096	float16	float16	float16	77.3%

Besides PVC, the kernel can achieve
92.7% bandwidth usage on MTL
84.7% bandwidth usage on A750

Reset to bfdbaf4 --------- Co-authored-by: mengfei25 <mengfei.li@Intel.com> Co-authored-by: LuFengqing <fengqing.lu@intel.com> Co-authored-by: Ratnam Parikh <114774508+ratnampa@users.noreply.github.com> Co-authored-by: Feng Yuan <feng1.yuan@intel.com>

mingfeima

the biggest question should be why we need post op fusion here? does pytorch have it with cuda?

src/ATen/native/xpu/sycl/LinearInt4.cpp

test/xpu/test_int4_linear.py

mingfeima · 2024-12-02T02:11:04Z

@liangan1 CC

mingfeima · 2024-12-02T02:18:00Z

@sunjiweiswift for the perf benchmarking, please include other configs expect M=1. This would serve as a reference of final decision making. I expect that big M would have worse perf, but that's fine, we still need to know the numbers.

#### Bugfix - [add lazy init for empty_xpu](#1115) - [nan propagation for soft_shrink](https://github.com/intel/torch-xpu-ops/pull/1116/files#diff-b7cb5876d000db957286c8b0e72badb2b7502402c8955334f1cc21c34c98a5b9) --------- Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com> Co-authored-by: ZhiweiYan-96 <zhiwei.yan@intel.com>

Resolve: pytorch/pytorch#142102

test/xpu/test_linalg_xpu.py

src/ATen/native/xpu/sycl/LinearInt4.cpp

mingfeima

generally LGTM. good job:)

mingfeima · 2024-12-23T03:04:38Z

src/ATen/native/xpu/sycl/LinearInt4.cpp

+      for (int i = 0; i < k; i += GroupK * Unroll) {
+#pragma unroll
+        for (int iu = 0; iu < Unroll; iu++) {
+          uint8_t tmps8[TileK / 2];


maybe we can do a little template trick to simply this piece of logic, have a template that handles all scernios and then pass corresponding args when called.

template <typename scalar_t, int SgSize, int TileK, int Unroll> void tinygemm_kernel(...) if (k % (SgSize * 32 * Unroll) == 0) { // use tinygemm_kernel<...> else { // use tinygemm_kernel<...> }

not a must to have, just a little trick.

airMeng · 2024-12-23T03:46:42Z

@xytintel not only this PR but the latest several CI all failed, could you check?

2024-12-23T02:49:25.9617774Z /home/sdp/actions-runner-1/_work/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/ATen/native/transformers/SDPUtils.cpp: In function ‘bool sdp::can_use_mem_efficient_attention(sdp::sdp_params, bool)’:
2024-12-23T02:49:25.9624103Z /home/sdp/actions-runner-1/_work/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/ATen/native/transformers/SDPUtils.cpp:37:7: error: ‘array_of’ was not declared in this scope; did you mean ‘c10::array_of’?
2024-12-23T02:49:25.9625568Z    37 |       array_of<bool (*)(sdp_params const&, bool)>(
2024-12-23T02:49:25.9626012Z       |       ^~~~~~~~
2024-12-23T02:49:25.9626381Z       |       c10::array_of
2024-12-23T02:49:25.9627538Z In file included from /home/sdp/actions-runner-1/_work/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/ATen/native/transformers/SDPUtils.cpp:2:
2024-12-23T02:49:25.9629144Z /home/sdp/actions-runner-1/_work/torch-xpu-ops/pytorch/c10/util/Array.h:14:23: note: ‘c10::array_of’ declared here
2024-12-23T02:49:25.9630059Z    14 | inline constexpr auto array_of(T&&... t) -> std::array<V, sizeof...(T)> {
2024-12-23T02:49:25.9630604Z       |                       ^~~~~~~~
2024-12-23T02:49:25.9631919Z /home/sdp/actions-runner-1/_work/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/ATen/native/transformers/SDPUtils.cpp:37:16: error: expected primary-expression before ‘bool’
2024-12-23T02:49:25.9633177Z    37 |       array_of<bool (*)(sdp_params const&, bool)>(
2024-12-23T02:49:25.9633623Z       |                ^~~~
2024-12-23T02:49:25.9635453Z /home/sdp/actions-runner-1/_work/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/ATen/native/transformers/SDPUtils.cpp:50:18: error: expected primary-expression before ‘bool’
2024-12-23T02:49:25.9636852Z    50 |         array_of<bool (*)(sdp_params const&, bool)>(
2024-12-23T02:49:25.9637304Z       |                  ^~~~
2024-12-23T02:49:25.9638691Z /home/sdp/actions-runner-1/_work/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/ATen/native/transformers/SDPUtils.cpp:63:18: error: expected primary-expression before ‘bool’
2024-12-23T02:49:25.9639958Z    63 |         array_of<bool (*)(sdp_params const&, bool)>(

mingfeima · 2025-01-02T06:39:28Z

@EikanWang @liangan1 thoughts?

sunjiweiswift changed the title ~~Fp zp~~ linear_int4_kernel for XPU Nov 29, 2024

mingfeima requested changes Dec 2, 2024

View reviewed changes

xytintel and others added 2 commits December 3, 2024 14:55

[Release-2.6] Capture rrelu_with_noise noise mutation in compile (#1145)

7ecb0b1

Resolve: pytorch/pytorch#142102

sunjiweiswift force-pushed the fp_zp branch 2 times, most recently from faa79b7 to 5a08d2e Compare December 9, 2024 05:25

airMeng and others added 15 commits December 11, 2024 09:07

contiguous layout for sycl int4 kernel

5410f51

push without compile

e9311a3

update linearkernel

e3eaffa

fix some comiple error(not all)

2a664af

add sycl_ker_config_convention

0156ba5

reg kernel for pytorch

a58afec

add yaml for int4mm

f487b20

update yaml file

ce1c894

Modified some review comments

d61b198

modify fun name

d76a0ce

autogen: _weight_int4pack_mm_with_scales_and_zeros.out

870a3b5

param int->int64_t(python int is int64)

a9627f6

use AT_DISPATCH_FLOATING_TYPES_AND

952ead9

Keep the same name as pytorch's _weight_int4pack_mm

93804f9

modify UT for int4

9e50b68

sunjiweiswift force-pushed the fp_zp branch from 2424d54 to 4dfd8bd Compare December 12, 2024 07:13

sync UT with pytoch UT(linalg)

81a72f1

sunjiweiswift force-pushed the fp_zp branch from 4dfd8bd to 81a72f1 Compare December 12, 2024 07:15

sunjiweiswift added 3 commits December 12, 2024 07:23

col-major

a70df0a

UT pass for B ones

c08382c

update gemv

14bb4e0

sunjiweiswift added 3 commits December 17, 2024 03:10

fix scale and zp address

70a3e13

fix K large than 1024 UT

a590ad6

bug fix for FP16(BF16 maybe incorrect)

d6a2f3a

sunjiweiswift force-pushed the fp_zp branch from 78433cb to d6a2f3a Compare December 18, 2024 09:07

sunjiweiswift and others added 2 commits December 20, 2024 05:29

save

27f18c2

Merge branch 'main' into fp_zp

7f94b9b

airMeng requested a review from mingfeima December 20, 2024 07:09

airMeng reviewed Dec 20, 2024

View reviewed changes

test/xpu/test_linalg_xpu.py Outdated Show resolved Hide resolved

bugfix for Big Endian

42c18e9

airMeng reviewed Dec 20, 2024

View reviewed changes

src/ATen/native/xpu/sycl/LinearInt4.cpp Outdated Show resolved Hide resolved

Unify BF16 and FP16 Funtion

d832050

sunjiweiswift force-pushed the fp_zp branch from 6ecfa50 to d832050 Compare December 20, 2024 09:27

sunjiweiswift requested a review from airMeng December 20, 2024 09:33

fix compile warning

8385f7e

airMeng reviewed Dec 20, 2024

View reviewed changes

src/ATen/native/xpu/sycl/LinearInt4.cpp Outdated Show resolved Hide resolved

airMeng reviewed Dec 20, 2024

View reviewed changes

src/ATen/native/xpu/sycl/LinearInt4.cpp Outdated Show resolved Hide resolved

modify by review

f44ed70

mingfeima approved these changes Dec 23, 2024

View reviewed changes

sunjiweiswift added 3 commits December 24, 2024 16:58

Merge branch 'main' into fp_zp

09696b1

Merge branch 'main' into fp_zp

ebe8c7c

Merge branch 'main' into fp_zp

ce6c16b

mingfeima requested review from EikanWang and liangan1 January 2, 2025 06:39

Merge branch 'main' into fp_zp

dacf3b9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linear_int4_kernel for XPU #1130

linear_int4_kernel for XPU #1130

sunjiweiswift commented Nov 29, 2024 •

edited

Loading

mingfeima left a comment

mingfeima commented Dec 2, 2024

mingfeima commented Dec 2, 2024

mingfeima left a comment

mingfeima Dec 23, 2024

airMeng commented Dec 23, 2024

mingfeima commented Jan 2, 2025

linear_int4_kernel for XPU #1130

Are you sure you want to change the base?

linear_int4_kernel for XPU #1130

Conversation

sunjiweiswift commented Nov 29, 2024 • edited Loading

mingfeima left a comment

Choose a reason for hiding this comment

mingfeima commented Dec 2, 2024

mingfeima commented Dec 2, 2024

mingfeima left a comment

Choose a reason for hiding this comment

mingfeima Dec 23, 2024

Choose a reason for hiding this comment

airMeng commented Dec 23, 2024

mingfeima commented Jan 2, 2025

sunjiweiswift commented Nov 29, 2024 •

edited

Loading