[GPU] Consider small kInput fusions with concatenations in the horizontal loop fusion pass. #19372

sergachev · 2024-11-14T23:54:09Z

No description provided.

akuegel · 2024-11-15T08:32:45Z

Do we know this is always strictly better? Do you have a HloModule with benchmark results?
The input fusions with concatenate root use the dedicated concat emitter which can be quite a bit faster than emitting the concat as part of a loop fusion.

sergachev · 2024-11-15T11:21:08Z

The pass only fuses small fusions, runtime of which is dominated by ~constant per-kernel overheads. Fusing these horizontally should be always better, than using the special emitter for them.

Yes, this change is motivated by a benchmark. Below I'll paste data of a synthetic one, which uses the maximum output tensor size for this kind of fusions:

       name  duration [us]
0  f1_input      2528
1  f2_input      2432
2   f1_loop      2560
3   f2_loop      2400
4     fused      2976

input:

f1 {
  p = f16[32,2048] parameter(0)
  q = f16[96,2048] parameter(1)
  c = f16[128,2048] concatenate(p, q), dimensions={0}
}

f2 {
  p = f16[128,1536] parameter(0)
  q = f16[128,512] parameter(1)
  c = f16[128,2048] concatenate(p, q), dimensions={1}
}

e {
  p = f16[32,2048] constant({...})
  q = f16[96,2048] constant({...})
  f1_input = f16[128,2048] fusion(p, q), kind=kInput, calls=f1
  r = f16[128,1536] constant({...})
  s = f16[128,512] constant({...})
  f2_input = f16[128,2048] fusion(r, s), kind=kInput, calls=f2
  t = tuple(f1_input, f2_input)
}

loop:

f1 {
  p = f16[32,2048] parameter(0)
  q = f16[96,2048] parameter(1)
  c = f16[128,2048] concatenate(p, q), dimensions={0}
}

f2 {
  p = f16[128,1536] parameter(0)
  q = f16[128,512] parameter(1)
  c = f16[128,2048] concatenate(p, q), dimensions={1}
}

e {
  p = f16[32,2048] constant({...})
  q = f16[96,2048] constant({...})
  f1_loop = f16[128,2048] fusion(p, q), kind=kLoop, calls=f1
  r = f16[128,1536] constant({...})
  s = f16[128,512] constant({...})
  f2_loop = f16[128,2048] fusion(r, s), kind=kLoop, calls=f2
  t = tuple(f1_loop, f2_loop)
}

fused:

horizontally_fused_computation {
  param_0_0 = f16[32,2048]{1,0} parameter(0)
  param_0_1 = f16[96,2048]{1,0} parameter(1)
  c.2 = f16[128,2048]{1,0} concatenate(param_0_0, param_0_1), dimensions={0}
  param_1_0 = f16[128,1536]{1,0} parameter(2)
  param_1_1 = f16[128,512]{1,0} parameter(3)
  c.3 = f16[128,2048]{1,0} concatenate(param_1_0, param_1_1), dimensions={1}
  tuple = (f16[128,2048]{1,0}, f16[128,2048]{1,0}) tuple(c.2, c.3)
}

e {
  s = f16[128,512]{1,0} constant({...})
  r = f16[128,1536]{1,0} constant({...})
  q.2 = f16[96,2048]{1,0} constant({...})
  p.2 = f16[32,2048]{1,0} constant({...})
  fused = (f16[128,2048]{1,0}, f16[128,2048]{1,0}) fusion(p.2, q.2, r, s), kind=kLoop, calls=horizontally_fused_computation
  get-tuple-element.1 = f16[128,2048]{1,0} get-tuple-element(fused), index=1
  get-tuple-element = f16[128,2048]{1,0} get-tuple-element(fused), index=0
  bitcast = f16[128,2048]{1,0} bitcast(get-tuple-element)
  bitcast.1 = f16[128,2048]{1,0} bitcast(get-tuple-element.1)
  t = (f16[128,2048]{1,0}, f16[128,2048]{1,0}) tuple(bitcast, bitcast.1)
}

sergachev · 2024-11-15T12:14:38Z

I realized, the benchmark can be made better by 1) adding some elementwise ops so that the special path for copy is not used 2) testing separately the two paths of the horizontal loop fusion pass, for same and different output shapes, as they have different thresholds. With the same shapes, because of the higher threshold, the fusions can be more significant and the effect of this change is still positive, but less important. So probably, to be safe, I should restrict this change to only small fusions.

Different shapes:

       name  duration
0  f1_input      2720
1  f2_input      2400
2   f1_loop      2560
3   f2_loop      2368
4     fused      4096

Same shapes:

       name  duration
0  f1_input     85319
1  f2_input     93799
2   f1_loop     85414
3   f2_loop     94280
4     fused    174286

different_fused.txt
different_input.txt
different_loop.txt
same_fused.txt
same_input.txt
same_loop.txt

sergachev · 2024-11-15T14:50:25Z

I updated the change, please take another look.

…ntal loop fusion pass.

sergachev requested review from olegshyshkov and akuegel November 14, 2024 23:54

sergachev force-pushed the horizontal_fusion_concat branch 2 times, most recently from 9f7ffe2 to 9990047 Compare November 15, 2024 14:48

sergachev changed the title ~~[GPU] Consider also kInput fusions with concatenations in the horizontal loop fusion pass.~~ [GPU] Consider small kInput fusions with concatenations in the horizontal loop fusion pass. Nov 15, 2024

[GPU] Consider small kInput fusions with concatenations in the horizo…

9990047

…ntal loop fusion pass.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] Consider small kInput fusions with concatenations in the horizontal loop fusion pass. #19372

[GPU] Consider small kInput fusions with concatenations in the horizontal loop fusion pass. #19372

sergachev commented Nov 14, 2024

akuegel commented Nov 15, 2024

sergachev commented Nov 15, 2024

sergachev commented Nov 15, 2024

sergachev commented Nov 15, 2024

[GPU] Consider small kInput fusions with concatenations in the horizontal loop fusion pass. #19372

Are you sure you want to change the base?

[GPU] Consider small kInput fusions with concatenations in the horizontal loop fusion pass. #19372

Conversation

sergachev commented Nov 14, 2024

akuegel commented Nov 15, 2024

sergachev commented Nov 15, 2024

sergachev commented Nov 15, 2024

sergachev commented Nov 15, 2024