Peeling last loop iteration (K), changing kPerBlock for the last iteration #1681

dhernandez0 · 2024-10-14T15:10:14Z

In this PR we peel the last iteration of the K loop and use a different kPerBlock for that last iteration. Note that we don't change the underlying mfma/wmma instruction. This is implemented for GemmAccel only.

MI200 performance:

I didn't see any advantage for sdxl-conv-configs, sdxl-gemm-configs or attention-configs performance files.
However, if I modify sdxl-conv-configs to have input channels +1 (for example, 961 instead of 960). We get some speed up for some cases:

	baseline TFlops	PR TFlops	speed up
0	37.05134	37.8174273	1.02
1	33.91353	34.8504553	1.03
2	37.50849	39.0126063	1.04
3	36.58647	37.8949534	1.04
4	31.26327	33.0823337	1.06
5	33.69692	34.9966659	1.04
6	25.69312	26.2322019	1.02
7	34.68515	36.5979827	1.06
8	35.8654	36.5057035	1.02
9	58.28759	58.2399655	1.00
10	62.06203	62.0663362	1.00
11	50.07341	49.0323258	0.98
12	60.10395	58.5503965	0.97
13	51.28793	50.0925666	0.98
14	60.34764	60.8887701	1.01
15	40.26413	40.7533634	1.01
16	33.21385	32.2993437	0.97
17	2.409298	2.44047801	1.01
18	49.87407	49.4921623	0.99
19	20.63765	21.1103794	1.02
20	40.52892	39.8692318	0.98
21	57.04472	55.8216153	0.98
22	53.64128	53.9505246	1.01
23	49.71588	50.1225875	1.01
24	43.99072	44.5737981	1.01
25	40.6225	40.990176	1.01
26	49.51871	47.8835196	0.97

MI300 performance (resnet50):

Almost all ones for unet files. Same for resnet50:

mxr file	baseline	peelk	ratio
mlir_convolution_512x512x7x7s25088x1x3584x512_2048x512x1x1.mxr	5775.49	5677.44	0.983023
mlir_convolution_add_add_relu_512x128x28x28s100352x1x3584x128_512x128x1x1.mxr	2359.32	2449.32	1.038147
mlir_convolution_add_add_relu_512x256x14x14s50176x1x3584x256_1024x256x1x1.mxr	3273.65	3250.07	0.992797
mlir_convolution_add_add_relu_512x512x7x7s25088x1x3584x512_2048x512x1x1.mxr	4943.25	4892.53	0.98974
mlir_convolution_add_add_relu_512x64x56x56s200704x1x3584x64_256x64x1x1.mxr	1418.99	1419.2	1.000148
mlir_convolution_add_relu_512x1024x14x14s200704x1x14336x1024_256x1024x1x1.mxr	5363.94	5294.66	0.987084
mlir_convolution_add_relu_512x1024x14x14s200704x1x14336x1024_512x1024x1x1.mxr	3103.35	3090.83	0.995966
mlir_convolution_add_relu_512x128x56x56s401408x1x7168x128_256x128x1x1.mxr	1738.02	1739.69	1.000961
mlir_convolution_add_relu_512x1536x7x7s75264x1x10752x1536_2048x1536x1x1.mxr	2362.74	2383.75	1.008892
mlir_convolution_add_relu_512x2048x7x7s100352x1x14336x2048_512x2048x1x1.mxr	5687.08	5817.23	1.022885
mlir_convolution_add_relu_512x256x56x56s802816x1x14336x256_128x256x1x1.mxr	1873.7	1883.33	1.00514
mlir_convolution_add_relu_512x256x56x56s802816x1x14336x256_64x256x1x1.mxr	2667.9	2710.82	1.016088
mlir_convolution_add_relu_512x384x28x28s301056x1x10752x384_512x384x1x1.mxr	1848.73	1812.72	0.980522
mlir_convolution_add_relu_512x3x224x224s150528x1x672x3_64x3x7x7s147x1x21x3.mxr	979.562	981.472	1.00195
mlir_convolution_add_relu_512x512x28x28s401408x1x14336x512_128x512x1x1.mxr	3909.07	4007.9	1.025282
mlir_convolution_add_relu_512x512x28x28s401408x1x14336x512_256x512x1x1.mxr	2344.05	2360.19	1.006886
mlir_convolution_add_relu_512x64x56x56s200704x1x3584x64_64x64x1x1.mxr	6782.85	6958.76	1.025935
mlir_convolution_add_relu_512x768x14x14s150528x1x10752x768_1024x768x1x1.mxr	2189.44	2178.83	0.995154

codecov · 2024-10-15T16:34:30Z

Codecov Report

Attention: Patch coverage is 88.93130% with 58 lines in your changes missing coverage. Please review.

Project coverage is 77.95%. Comparing base (5f51701) to head (8fbc8e0).
Report is 5 commits behind head on develop.

Files with missing lines	Patch %	Lines
...ialect/Rock/Transforms/GridwiseGemmToBlockwise.cpp	88.20%	25 Missing and 25 partials ⚠️
...lir/lib/Dialect/Rock/utility/transformMapUtils.cpp	53.84%	3 Missing and 3 partials ⚠️
mlir/lib/Dialect/Rock/utility/AccelEmitter.cpp	83.33%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1681      +/-   ##
===========================================
+ Coverage    77.76%   77.95%   +0.18%     
===========================================
  Files          100      100              
  Lines        27866    28147     +281     
  Branches      4063     4115      +52     
===========================================
+ Hits         21671    21942     +271     
  Misses        4540     4540              
- Partials      1655     1665      +10

Flag	Coverage Δ
mfma	`77.95% <88.93%> (+0.18%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pfultz2 · 2024-10-15T21:42:53Z

How does this improve the performance for the resnet50 configs?

manupak · 2024-10-16T13:03:41Z

Im not actively reviewing the PR now as we discussed.

My understanding is if things are performant, we ll do an initial lowering of :

rock.gridwise_gemm C = A x B

to

rock.gridwise_partial_gemm C += A[ : , 0:k1] x B[0:k1 , :]
rock.gridwise_partial_gemm C += A[ : , k1:K] x B[k1:K , :]
rock.threadwise_write_all C

where the code here will be put under lowering of rock.gridwise_partial_gemm

dhernandez0 self-assigned this Oct 14, 2024

dhernandez0 requested review from jerryyin and sjw36 as code owners October 14, 2024 15:10

dhernandez0 force-pushed the 1545-try-peeling-last-loop-iteration-eliding-gemm-k-padding-checks branch from d022835 to a333d6b Compare October 15, 2024 07:17

dhernandez0 changed the title ~~Try peeling last loop iteration eliding gemm k padding checks~~ Peeling last loop iteration (K), changing kPerBlock for the last iteration Oct 15, 2024

dhernandez0 requested review from krzysz00 and manupak October 15, 2024 09:43

dhernandez0 force-pushed the 1545-try-peeling-last-loop-iteration-eliding-gemm-k-padding-checks branch 4 times, most recently from 4199151 to 5776ff2 Compare October 28, 2024 16:07

dhernandez0 added 2 commits October 29, 2024 14:54

Peel last iteration of k loop

3006f68

No need to pad for LDS write, we want to materialize the zeros

8fbc8e0

dhernandez0 force-pushed the 1545-try-peeling-last-loop-iteration-eliding-gemm-k-padding-checks branch from 6f43e33 to 8fbc8e0 Compare October 29, 2024 14:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Peeling last loop iteration (K), changing kPerBlock for the last iteration #1681

Peeling last loop iteration (K), changing kPerBlock for the last iteration #1681

dhernandez0 commented Oct 14, 2024 •

edited

Loading

codecov bot commented Oct 15, 2024 •

edited

Loading

pfultz2 commented Oct 15, 2024

manupak commented Oct 16, 2024 •

edited

Loading

Peeling last loop iteration (K), changing kPerBlock for the last iteration #1681

Are you sure you want to change the base?

Peeling last loop iteration (K), changing kPerBlock for the last iteration #1681

Conversation

dhernandez0 commented Oct 14, 2024 • edited Loading

MI200 performance:

MI300 performance (resnet50):

codecov bot commented Oct 15, 2024 • edited Loading

Codecov Report

pfultz2 commented Oct 15, 2024

manupak commented Oct 16, 2024 • edited Loading

dhernandez0 commented Oct 14, 2024 •

edited

Loading

codecov bot commented Oct 15, 2024 •

edited

Loading

manupak commented Oct 16, 2024 •

edited

Loading