Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] Consider small kInput fusions with concatenations in the horizontal loop fusion pass. #19372

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sergachev
Copy link
Contributor

No description provided.

@akuegel
Copy link
Member

akuegel commented Nov 15, 2024

Do we know this is always strictly better? Do you have a HloModule with benchmark results?
The input fusions with concatenate root use the dedicated concat emitter which can be quite a bit faster than emitting the concat as part of a loop fusion.

@sergachev
Copy link
Contributor Author

The pass only fuses small fusions, runtime of which is dominated by ~constant per-kernel overheads. Fusing these horizontally should be always better, than using the special emitter for them.

Yes, this change is motivated by a benchmark. Below I'll paste data of a synthetic one, which uses the maximum output tensor size for this kind of fusions:

       name  duration [us]
0  f1_input      2528
1  f2_input      2432
2   f1_loop      2560
3   f2_loop      2400
4     fused      2976

input:

f1 {
  p = f16[32,2048] parameter(0)
  q = f16[96,2048] parameter(1)
  c = f16[128,2048] concatenate(p, q), dimensions={0}
}

f2 {
  p = f16[128,1536] parameter(0)
  q = f16[128,512] parameter(1)
  c = f16[128,2048] concatenate(p, q), dimensions={1}
}

e {
  p = f16[32,2048] constant({...})
  q = f16[96,2048] constant({...})
  f1_input = f16[128,2048] fusion(p, q), kind=kInput, calls=f1
  r = f16[128,1536] constant({...})
  s = f16[128,512] constant({...})
  f2_input = f16[128,2048] fusion(r, s), kind=kInput, calls=f2
  t = tuple(f1_input, f2_input)
}

loop:

f1 {
  p = f16[32,2048] parameter(0)
  q = f16[96,2048] parameter(1)
  c = f16[128,2048] concatenate(p, q), dimensions={0}
}

f2 {
  p = f16[128,1536] parameter(0)
  q = f16[128,512] parameter(1)
  c = f16[128,2048] concatenate(p, q), dimensions={1}
}

e {
  p = f16[32,2048] constant({...})
  q = f16[96,2048] constant({...})
  f1_loop = f16[128,2048] fusion(p, q), kind=kLoop, calls=f1
  r = f16[128,1536] constant({...})
  s = f16[128,512] constant({...})
  f2_loop = f16[128,2048] fusion(r, s), kind=kLoop, calls=f2
  t = tuple(f1_loop, f2_loop)
}

fused:

horizontally_fused_computation {
  param_0_0 = f16[32,2048]{1,0} parameter(0)
  param_0_1 = f16[96,2048]{1,0} parameter(1)
  c.2 = f16[128,2048]{1,0} concatenate(param_0_0, param_0_1), dimensions={0}
  param_1_0 = f16[128,1536]{1,0} parameter(2)
  param_1_1 = f16[128,512]{1,0} parameter(3)
  c.3 = f16[128,2048]{1,0} concatenate(param_1_0, param_1_1), dimensions={1}
  tuple = (f16[128,2048]{1,0}, f16[128,2048]{1,0}) tuple(c.2, c.3)
}

e {
  s = f16[128,512]{1,0} constant({...})
  r = f16[128,1536]{1,0} constant({...})
  q.2 = f16[96,2048]{1,0} constant({...})
  p.2 = f16[32,2048]{1,0} constant({...})
  fused = (f16[128,2048]{1,0}, f16[128,2048]{1,0}) fusion(p.2, q.2, r, s), kind=kLoop, calls=horizontally_fused_computation
  get-tuple-element.1 = f16[128,2048]{1,0} get-tuple-element(fused), index=1
  get-tuple-element = f16[128,2048]{1,0} get-tuple-element(fused), index=0
  bitcast = f16[128,2048]{1,0} bitcast(get-tuple-element)
  bitcast.1 = f16[128,2048]{1,0} bitcast(get-tuple-element.1)
  t = (f16[128,2048]{1,0}, f16[128,2048]{1,0}) tuple(bitcast, bitcast.1)
}

@sergachev
Copy link
Contributor Author

I realized, the benchmark can be made better by 1) adding some elementwise ops so that the special path for copy is not used 2) testing separately the two paths of the horizontal loop fusion pass, for same and different output shapes, as they have different thresholds. With the same shapes, because of the higher threshold, the fusions can be more significant and the effect of this change is still positive, but less important. So probably, to be safe, I should restrict this change to only small fusions.

Different shapes:

       name  duration
0  f1_input      2720
1  f2_input      2400
2   f1_loop      2560
3   f2_loop      2368
4     fused      4096

Same shapes:

       name  duration
0  f1_input     85319
1  f2_input     93799
2   f1_loop     85414
3   f2_loop     94280
4     fused    174286

different_fused.txt
different_input.txt
different_loop.txt
same_fused.txt
same_input.txt
same_loop.txt

@sergachev sergachev force-pushed the horizontal_fusion_concat branch 2 times, most recently from 9f7ffe2 to 9990047 Compare November 15, 2024 14:48
@sergachev sergachev changed the title [GPU] Consider also kInput fusions with concatenations in the horizontal loop fusion pass. [GPU] Consider small kInput fusions with concatenations in the horizontal loop fusion pass. Nov 15, 2024
@sergachev
Copy link
Contributor Author

I updated the change, please take another look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants