Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peeling last loop iteration (K), changing kPerBlock for the last iteration #1681

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

dhernandez0
Copy link
Contributor

@dhernandez0 dhernandez0 commented Oct 14, 2024

In this PR we peel the last iteration of the K loop and use a different kPerBlock for that last iteration. Note that we don't change the underlying mfma/wmma instruction. This is implemented for GemmAccel only.

MI200 performance:

I didn't see any advantage for sdxl-conv-configs, sdxl-gemm-configs or attention-configs performance files.
However, if I modify sdxl-conv-configs to have input channels +1 (for example, 961 instead of 960). We get some speed up for some cases:

baseline TFlops PR TFlops speed up
0 37.05134 37.8174273 1.02
1 33.91353 34.8504553 1.03
2 37.50849 39.0126063 1.04
3 36.58647 37.8949534 1.04
4 31.26327 33.0823337 1.06
5 33.69692 34.9966659 1.04
6 25.69312 26.2322019 1.02
7 34.68515 36.5979827 1.06
8 35.8654 36.5057035 1.02
9 58.28759 58.2399655 1.00
10 62.06203 62.0663362 1.00
11 50.07341 49.0323258 0.98
12 60.10395 58.5503965 0.97
13 51.28793 50.0925666 0.98
14 60.34764 60.8887701 1.01
15 40.26413 40.7533634 1.01
16 33.21385 32.2993437 0.97
17 2.409298 2.44047801 1.01
18 49.87407 49.4921623 0.99
19 20.63765 21.1103794 1.02
20 40.52892 39.8692318 0.98
21 57.04472 55.8216153 0.98
22 53.64128 53.9505246 1.01
23 49.71588 50.1225875 1.01
24 43.99072 44.5737981 1.01
25 40.6225 40.990176 1.01
26 49.51871 47.8835196 0.97

MI300 performance (resnet50):

Almost all ones for unet files. Same for resnet50:

mxr file baseline peelk ratio
mlir_convolution_512x512x7x7s25088x1x3584x512_2048x512x1x1.mxr 5775.49 5677.44 0.983023
mlir_convolution_add_add_relu_512x128x28x28s100352x1x3584x128_512x128x1x1.mxr 2359.32 2449.32 1.038147
mlir_convolution_add_add_relu_512x256x14x14s50176x1x3584x256_1024x256x1x1.mxr 3273.65 3250.07 0.992797
mlir_convolution_add_add_relu_512x512x7x7s25088x1x3584x512_2048x512x1x1.mxr 4943.25 4892.53 0.98974
mlir_convolution_add_add_relu_512x64x56x56s200704x1x3584x64_256x64x1x1.mxr 1418.99 1419.2 1.000148
mlir_convolution_add_relu_512x1024x14x14s200704x1x14336x1024_256x1024x1x1.mxr 5363.94 5294.66 0.987084
mlir_convolution_add_relu_512x1024x14x14s200704x1x14336x1024_512x1024x1x1.mxr 3103.35 3090.83 0.995966
mlir_convolution_add_relu_512x128x56x56s401408x1x7168x128_256x128x1x1.mxr 1738.02 1739.69 1.000961
mlir_convolution_add_relu_512x1536x7x7s75264x1x10752x1536_2048x1536x1x1.mxr 2362.74 2383.75 1.008892
mlir_convolution_add_relu_512x2048x7x7s100352x1x14336x2048_512x2048x1x1.mxr 5687.08 5817.23 1.022885
mlir_convolution_add_relu_512x256x56x56s802816x1x14336x256_128x256x1x1.mxr 1873.7 1883.33 1.00514
mlir_convolution_add_relu_512x256x56x56s802816x1x14336x256_64x256x1x1.mxr 2667.9 2710.82 1.016088
mlir_convolution_add_relu_512x384x28x28s301056x1x10752x384_512x384x1x1.mxr 1848.73 1812.72 0.980522
mlir_convolution_add_relu_512x3x224x224s150528x1x672x3_64x3x7x7s147x1x21x3.mxr 979.562 981.472 1.00195
mlir_convolution_add_relu_512x512x28x28s401408x1x14336x512_128x512x1x1.mxr 3909.07 4007.9 1.025282
mlir_convolution_add_relu_512x512x28x28s401408x1x14336x512_256x512x1x1.mxr 2344.05 2360.19 1.006886
mlir_convolution_add_relu_512x64x56x56s200704x1x3584x64_64x64x1x1.mxr 6782.85 6958.76 1.025935
mlir_convolution_add_relu_512x768x14x14s150528x1x10752x768_1024x768x1x1.mxr 2189.44 2178.83 0.995154

@dhernandez0 dhernandez0 self-assigned this Oct 14, 2024
@dhernandez0 dhernandez0 force-pushed the 1545-try-peeling-last-loop-iteration-eliding-gemm-k-padding-checks branch from d022835 to a333d6b Compare October 15, 2024 07:17
@dhernandez0 dhernandez0 changed the title Try peeling last loop iteration eliding gemm k padding checks Peeling last loop iteration (K), changing kPerBlock for the last iteration Oct 15, 2024
Copy link

codecov bot commented Oct 15, 2024

Codecov Report

Attention: Patch coverage is 88.93130% with 58 lines in your changes missing coverage. Please review.

Project coverage is 77.95%. Comparing base (5f51701) to head (8fbc8e0).
Report is 5 commits behind head on develop.

Files with missing lines Patch % Lines
...ialect/Rock/Transforms/GridwiseGemmToBlockwise.cpp 88.20% 25 Missing and 25 partials ⚠️
...lir/lib/Dialect/Rock/utility/transformMapUtils.cpp 53.84% 3 Missing and 3 partials ⚠️
mlir/lib/Dialect/Rock/utility/AccelEmitter.cpp 83.33% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1681      +/-   ##
===========================================
+ Coverage    77.76%   77.95%   +0.18%     
===========================================
  Files          100      100              
  Lines        27866    28147     +281     
  Branches      4063     4115      +52     
===========================================
+ Hits         21671    21942     +271     
  Misses        4540     4540              
- Partials      1655     1665      +10     
Flag Coverage Δ
mfma 77.95% <88.93%> (+0.18%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pfultz2
Copy link

pfultz2 commented Oct 15, 2024

How does this improve the performance for the resnet50 configs?

@manupak
Copy link
Contributor

manupak commented Oct 16, 2024

Im not actively reviewing the PR now as we discussed.

My understanding is if things are performant, we ll do an initial lowering of :

rock.gridwise_gemm C = A x B

to

rock.gridwise_partial_gemm C += A[ : , 0:k1] x B[0:k1 , :]
rock.gridwise_partial_gemm C += A[ : , k1:K] x B[k1:K , :]
rock.threadwise_write_all C 

where the code here will be put under lowering of rock.gridwise_partial_gemm

@dhernandez0 dhernandez0 force-pushed the 1545-try-peeling-last-loop-iteration-eliding-gemm-k-padding-checks branch 4 times, most recently from 4199151 to 5776ff2 Compare October 28, 2024 16:07
@dhernandez0 dhernandez0 force-pushed the 1545-try-peeling-last-loop-iteration-eliding-gemm-k-padding-checks branch from 6f43e33 to 8fbc8e0 Compare October 29, 2024 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants