-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Peeling last loop iteration (K), changing kPerBlock for the last iteration #1681
base: develop
Are you sure you want to change the base?
Peeling last loop iteration (K), changing kPerBlock for the last iteration #1681
Conversation
d022835
to
a333d6b
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #1681 +/- ##
===========================================
+ Coverage 77.76% 77.95% +0.18%
===========================================
Files 100 100
Lines 27866 28147 +281
Branches 4063 4115 +52
===========================================
+ Hits 21671 21942 +271
Misses 4540 4540
- Partials 1655 1665 +10
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
How does this improve the performance for the resnet50 configs? |
Im not actively reviewing the PR now as we discussed. My understanding is if things are performant, we ll do an initial lowering of : rock.gridwise_gemm C = A x B to rock.gridwise_partial_gemm C += A[ : , 0:k1] x B[0:k1 , :]
rock.gridwise_partial_gemm C += A[ : , k1:K] x B[k1:K , :]
rock.threadwise_write_all C where the code here will be put under lowering of |
4199151
to
5776ff2
Compare
6f43e33
to
8fbc8e0
Compare
In this PR we peel the last iteration of the K loop and use a different kPerBlock for that last iteration. Note that we don't change the underlying mfma/wmma instruction. This is implemented for GemmAccel only.
MI200 performance:
I didn't see any advantage for sdxl-conv-configs, sdxl-gemm-configs or attention-configs performance files.
However, if I modify sdxl-conv-configs to have input channels +1 (for example, 961 instead of 960). We get some speed up for some cases:
MI300 performance (resnet50):
Almost all ones for unet files. Same for resnet50: