JIT: Remove fallthrough restrictions in `Compiler::fgOptimizeBranch` #105625

amanasifkhalid · 2024-07-29T14:16:18Z

While investigating perf regressions from flowgraph changes, I noticed some instances where Compiler::fgOptimizeBranch was no longer kicking in due to conditional blocks not matching the implicit fallthrough constraint we previously removed (example). Removing this constraint is simple enough, and fixes the subpar block layout (and the edge likelihoods of the cloned condition) in the linked example:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight        IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             0.91       91 [000..019)-> BB03(1)                 (always)                     i IBC hascall
BB03 [0014]  2       BB01,BB10             9.09      909 [018..019)-> BB05(1)                 (always)                     i IBC loophead mdidxlen bwd bwd-target mdarr
BB05 [0015]  2       BB03,BB08            90.91     9091 [018..019)-> BB07(1)                 (always)                     i IBC loophead mdidxlen bwd bwd-target mdarr
BB07 [0016]  2       BB05,BB07            1000.   100000 [018..019)-> BB07(0.909),BB08(0.0909)  ( cond )                     i IBC loophead mdidxlen bwd bwd-target mdarr
BB08 [0063]  1       BB07                100.00    10000 [018..019)-> BB05(0.909),BB10(0.0909)  ( cond )                     i IBC bwd
BB10 [0064]  1       BB08                 10.00     1000 [018..019)-> BB03(0.909),BB12(0.0909)  ( cond )                     i IBC bwd
BB12 [0065]  1       BB10                  1.00      100 [018..06D)-> BB14(0.994),BB28(0.00595)   ( cond )                     i IBC bwd
BB14 [0049]  2       BB12,BB26           153.66    15366 [020..024)-> BB17(1)                 (always)                     i IBC loophead bwd
BB17 [0002]  2       BB14,BB24            1741.   174141 [024..029)-> BB19(1)                 (always)                     i IBC loophead bwd bwd-target
BB19 [0003]  2       BB17,BB22           17492.  1749182 [029..02E)-> BB21(1)                 (always)                     i IBC loophead mdidxlen bwd bwd-target mdarr
BB21 [0004]  2       BB19,BB21             194k 19378019 [02E..050)-> BB21(0.91),BB22(0.0903) ( cond )                     i IBC loophead mdidxlen bwd bwd-target mdarr
BB22 [0059]  1       BB21                19227.  1922741 [050..05C)-> BB19(0.909),BB24(0.0906)  ( cond )                     i IBC bwd
BB24 [0060]  1       BB22                 1915.   191483 [05C..065)-> BB17(0.92),BB26(0.0802) ( cond )                     i IBC bwd
BB26 [0061]  1       BB24                167.07    16707 [065..06A)-> BB14(0.994),BB28(0.00595)   ( cond )                     i IBC bwd
BB28 [0062]  2       BB12,BB26             0.91       91 [06D..06E)-> BB31(1)                 (always)                     i IBC
BB31 [0025]  2       BB28,BB39             9.09      909 [06D..06E)-> BB33(1)                 (always)                     i IBC loophead bwd bwd-target
BB33 [0026]  2       BB31,BB37            90.91     9091 [06D..06E)-> BB35(1)                 (always)                     i IBC loophead mdidxlen bwd bwd-target mdarr
BB35 [0027]  2       BB33,BB36            1000.   100000 [06D..06E)-> BB36(1),BB43(0)         ( cond )                     i IBC loophead mdidxlen bwd bwd-target mdarr
BB36 [0029]  1       BB35                 1000.   100000 [06D..06E)-> BB35(0.909),BB37(0.0909)  ( cond )                     i IBC bwd
BB37 [0056]  1       BB36                100.00    10000 [06D..06E)-> BB33(0.909),BB39(0.0909)  ( cond )                     i IBC bwd
BB39 [0057]  1       BB37                 10.00     1000 [06D..06E)-> BB31(0.909),BB41(0.0909)  ( cond )                     i IBC bwd
BB41 [0058]  1       BB39                  1.00      100 [06D..06E)-> BB44(1)                 (always)                     i IBC
BB44 [0036]  2       BB41,BB43             1.00      100 [06D..077)                           (return)                     i IBC
BB43 [0028]  1       BB35                  0           0 [06D..06E)-> BB44(1)                 (always)                     i IBC rare
BB45 [0066]  0                             0             [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

I was hoping this would improve the layout of some of the regressed methods in #103972, though we seem to hit the cost limit for cloning quite frequently. For example, from System.Linq.Tests.Perf_Enumerable.Aggregate_Seed(input: IEnumerable):

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight       IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1      19852 [000..003)-> BB03(1),BB02(0)         ( cond )                     i IBC
BB02 [0001]  1       BB01                  0          0 [003..00A)                           (throw )                     i IBC rare hascall gcsafe
BB03 [0002]  1       BB01                  1      19852 [00A..00D)-> BB05(1),BB04(0)         ( cond )                     i IBC
BB04 [0003]  1       BB03                  0          0 [00D..013)                           (throw )                     i IBC rare hascall gcsafe
BB05 [0004]  1       BB03                  1      19852 [013..015)-> BB07(0),BB06(1)         ( cond )                     i IBC
BB06 [0016]  1       BB05                  1      19852 [???..???)-> BB08(1)                 (always)                     i IBC internal hascall gcsafe
BB07 [0017]  1       BB05                  0          0 [???..???)-> BB08(1)                 (always)                     i IBC rare internal hascall gcsafe
BB08 [0015]  2       BB06,BB07             1      19852 [015..01C)-> BB09(1)                 (always)                     i IBC internal
BB09 [0005]  2  0    BB08,BB17           100.54 1995916 [01C..02E)-> BB21(0),BB18(1)         ( cond ) T0      try {       i IBC keep bwd
BB10 [0006]  2  0    BB19,BB21            99.54 1976064 [01E..01E)-> BB14(0),BB11(1)         ( cond ) T0                  i IBC bwd bwd-target
BB11 [0019]  1  0    BB10                 99.54 1976064 [01E..01F)-> BB13(1),BB12(0)         ( cond ) T0                  i IBC bwd
BB12 [0032]  1  0    BB11                  0          0 [01E..01F)                           (throw ) T0                  i IBC rare hascall gcsafe bwd
BB13 [0033]  1  0    BB11                 99.54 1976064 [01E..025)-> BB16(0),BB15(1)         ( cond ) T0                  i IBC idxlen bwd
BB14 [0020]  1  0    BB10                  0          0 [???..???)-> BB16(1)                 (always) T0                  i IBC rare internal hascall gcsafe bwd
BB15 [0022]  1  0    BB13                 99.54 1976064 [???..???)-> BB17(1)                 (always) T0                  i IBC internal bwd
BB16 [0023]  2  0    BB13,BB14             0          0 [???..???)-> BB17(1)                 (always) T0                  i IBC rare internal hascall gcsafe bwd
BB17 [0021]  2  0    BB15,BB16            99.54 1976064 [025..02E)-> BB09(1)                 (always) T0                  i IBC internal bwd
BB18 [0025]  1  0    BB09                100.54 1995916 [02E..02F)-> BB20(0.00998),BB19(0.99)  ( cond ) T0                  i IBC bwd
BB19 [0037]  1  0    BB18                 99.54 1975990 [02E..02F)-> BB10(1)                 (always) T0                  i IBC bwd
BB20 [0038]  1  0    BB18                  1.00   19926 [02E..02F)-> BB23(1)                 (always) T0                  i IBC bwd
BB21 [0026]  1  0    BB09                  0          0 [02E..036)-> BB10(0.99),BB23(0.00995)  ( cond ) T0      }           i IBC rare internal hascall bwd
BB23 [0041]  2       BB20,BB21             1.00   19772 [038..03B)-> BB27(1),BB26(0)         ( cond )                     i IBC keep cfb
BB26 [0044]  1       BB23                  0          0 [???..???)-> BB27(1)                 (always)                     i IBC rare internal hascall gcsafe
BB27 [0045]  2       BB23,BB26             1.00   19852 [041..044)                           (return)                     i IBC
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ funclets follow
BB28 [0009]  1     0                       0.00       0 [038..03B)-> BB32(0.00402),BB29(0.996)   ( cond )    H0 F fault {     i IBC keep flet
BB29 [0010]  1     0 BB28                  0.00       0 [03B..???)-> BB32(1),BB31(0)         ( cond )    H0               i IBC
BB31 [0029]  1     0 BB29                  0          0 [???..???)-> BB32(1)                 (always)    H0               i IBC rare internal hascall gcsafe
BB32 [0027]  3     0 BB28,BB29,BB31        0.00       0 [041..042)                           (falret)    H0   }           i IBC
---------------------------------------------------------------------------------------------------------------------------------------------------------------------


Duplication of the conditional block BB09 (always branch from BB17) not done, because the cost of duplication (17) is greater than 6, validProfileWeights = true

I wonder if we should increase the cost limit of fgOptimizeBranch to match that of loop inversion, since they seem to be performing similar transformations, though we can consider that for a follow-up PR.

amanasifkhalid · 2024-07-29T16:28:58Z

It's worth noting that we only call fgOptimizeBranch from fgReorderBlocks on the old layout path, and that only runs in optOptimizeFlow, where we aren't actually reordering blocks. So if we decide to get rid of premature calls to fgReorderBlocks (like in #104213), we'll want to keep running fgOptimizeBranch -- it seems relevant for enabling fallthrough into/out of loops.

amanasifkhalid · 2024-07-29T18:23:40Z

cc @dotnet/jit-contrib, @AndyAyersMS PTAL. Diffs. The largest size increases are due to more loop cloning (especially in libraries_tests), whereas the smaller sporadic size increases are from duplicating conditions -- this seems worthwhile, since we can reduce loop iteration logic to one backward conditional jump, rather than a backward unconditional jump to another conditional jump.

amanasifkhalid · 2024-08-15T14:42:46Z

@AndyAyersMS updated diffs look roughly the same. Is it alright if we start taking flowgraph changes for .NET 10?

AndyAyersMS · 2024-08-15T16:55:22Z

@AndyAyersMS updated diffs look roughly the same. Is it alright if we start taking flowgraph changes for .NET 10?

I think we may want to wait a little while longer? We can discuss this when we meet later today.

amanasifkhalid · 2024-09-10T20:31:39Z

It seems suboptimal that we run fgOptimizeBranch before layout, and the optimization is dependent on knowing if the BBJ_ALWAYS block falls into one of the BBJ_COND block's successors. It would be convenient if we could run this as an intermediate step of the new block layout, but since we intend to run the latter after lowering/LSRA, and fgOptimizeBranch operates on HIR and clones code, that doesn't seem like a possibility. Instead, I wonder if it would be worthwhile to iterate the blocklist in RPO during a late flow opt pass, and run fgOptimizeBranch for BBJ_ALWAYS candidates that aren't immediately succeeded by their BBJ_COND target, implying that cloning the condition would enable more fallthrough. The RPO traversal probably isn't resistant to changes in flow, but I don't think that's a problem for fgOptimizeBranch since we would only need to visit each candidate BBJ_ALWAYS once.

Generalize fgOptimizeBranch

0407f6c

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 29, 2024

dotnet-policy-service bot assigned amanasifkhalid Jul 29, 2024

amanasifkhalid added this to the 10.0.0 milestone Aug 12, 2024

Merge branch 'main' into cond-branch-opt

18fa8fb

Merge branch 'main' into cond-branch-opt

57db653

Merge branch 'main' into cond-branch-opt

b0638f8

build-analysis bot mentioned this pull request Nov 29, 2024

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Remove fallthrough restrictions in `Compiler::fgOptimizeBranch` #105625

JIT: Remove fallthrough restrictions in `Compiler::fgOptimizeBranch` #105625

amanasifkhalid commented Jul 29, 2024

amanasifkhalid commented Jul 29, 2024 •

edited

Loading

amanasifkhalid commented Jul 29, 2024 •

edited

Loading

amanasifkhalid commented Aug 15, 2024

AndyAyersMS commented Aug 15, 2024

amanasifkhalid commented Sep 10, 2024

JIT: Remove fallthrough restrictions in Compiler::fgOptimizeBranch #105625

Are you sure you want to change the base?

JIT: Remove fallthrough restrictions in Compiler::fgOptimizeBranch #105625

Conversation

amanasifkhalid commented Jul 29, 2024

amanasifkhalid commented Jul 29, 2024 • edited Loading

amanasifkhalid commented Jul 29, 2024 • edited Loading

amanasifkhalid commented Aug 15, 2024

AndyAyersMS commented Aug 15, 2024

amanasifkhalid commented Sep 10, 2024

JIT: Remove fallthrough restrictions in `Compiler::fgOptimizeBranch` #105625

JIT: Remove fallthrough restrictions in `Compiler::fgOptimizeBranch` #105625

amanasifkhalid commented Jul 29, 2024 •

edited

Loading

amanasifkhalid commented Jul 29, 2024 •

edited

Loading