-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU] Consider small kInput fusions with concatenations in the horizontal loop fusion pass. #19372
base: main
Are you sure you want to change the base?
Conversation
Do we know this is always strictly better? Do you have a HloModule with benchmark results? |
The pass only fuses small fusions, runtime of which is dominated by ~constant per-kernel overheads. Fusing these horizontally should be always better, than using the special emitter for them. Yes, this change is motivated by a benchmark. Below I'll paste data of a synthetic one, which uses the maximum output tensor size for this kind of fusions:
input:
loop:
fused:
|
I realized, the benchmark can be made better by 1) adding some elementwise ops so that the special path for copy is not used 2) testing separately the two paths of the horizontal loop fusion pass, for same and different output shapes, as they have different thresholds. With the same shapes, because of the higher threshold, the fusions can be more significant and the effect of this change is still positive, but less important. So probably, to be safe, I should restrict this change to only small fusions. Different shapes:
Same shapes:
different_fused.txt |
9f7ffe2
to
9990047
Compare
I updated the change, please take another look. |
…ntal loop fusion pass.
No description provided.