Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Implement greedy RPO-based block layout #101473

Merged
merged 17 commits into from
May 2, 2024

Conversation

amanasifkhalid
Copy link
Member

@amanasifkhalid amanasifkhalid commented Apr 24, 2024

Part of #93020. Compiler::fgDoReversePostOrderLayout reorders blocks based on a RPO of the flowgraph's successor edges. When reordering based on the RPO, we only reorder blocks within the same EH region to avoid breaking up their contiguousness. After establishing an RPO-based layout, we do another pass to move cold blocks to the ends of their regions in fgMoveColdBlocks.

The "greedy" part of this layout isn't all that greedy just yet. For now, we use edge likelihoods to make placement decisions only for BBJ_COND blocks' successors. I plan to extend this greediness to other multi-successor block kinds (BBJ_SWITCH, etc) in a follow-up so we can independently evaluate the value in doing so.

This new layout is disabled by default for now.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 24, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@amanasifkhalid amanasifkhalid marked this pull request as ready for review April 26, 2024 22:32
@amanasifkhalid
Copy link
Member Author

cc @dotnet/jit-contrib, @AndyAyersMS PTAL. Diffs are big, though libraries_tests seems to be inflating the numbers. Improvements and regressions seem to be universally larger on x64 versus arm64 -- I assume this is because many short branches are becoming long on x64, and many long branches are becoming short, etc. Also, some of the arm64 example diffs suggest we ought to extend our emitter-level branch removal optimization beyond x64: I see some branches that we should be able to get rid of.

In my previous iterations of this PR, I wanted the block order to mirror the RPO as closely as possible, and repair try regions as necessary afterwards. I got this approach working when using funclets, but Windows x86's EH model complicated this by allowing handlers in the main method body too... I tried iterating on my initial implementation to handle the non-funclet EH case, but I was approaching two separate solutions, which didn't seem ideal -- plus all of these local fixups is something we're trying to get away from in the old layout approach. The diffs don't seem any more or less extreme by restricting reordering to within EH regions, and the code is quite a bit easier to follow.

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Apr 27, 2024

Seems likely that some significant part of the size improvement is from not moving cold blocks out of line (not splitting, per se, just separating within the method or region). I think we may want this part to be implemented before we start running in the lab. It shouldn't be too hard to implement something like this: starting at the end of each EH region, move sufficiently cold blocks to the end, keeping them in relative RPO order.

I'm ok taking that as a follow up, if you think it makes sense. Though if it is indeed that simple it would not complicate this PR too much, and this PR is delightfully simple compared to the code it will replace.

You may recall in the example code I referred you as a model to there are some exceptions to this rule, and perhaps some thresholds to consider -- for instance sometimes we might not want to not move very small cold blocks that have hot pred and succ blocks, if we're fairly confident in the data, assuming we'd end up with denser code by keeping the branch short even if it means having a (well predicted) taken branch on the hot path.

@jakobbotsch
Copy link
Member

jakobbotsch commented Apr 27, 2024

Can you explain a bit further why you want the regular successor DFS? It saddens me a bit to see it return after the work I did last year to move everything to the faithful DFS. When I did that work I didn't find any justified uses of the old DFS.

For this case, why can't we run run the faithful DFS (say with your modifications to change the visit order) and then partition the RPO into multiple ones by EH region? I don't quite understand why we need to give up on reordering of all EH regions (or regions in the main body that are dominated by an EH region).

@amanasifkhalid
Copy link
Member Author

Can you explain a bit further why you want the regular successor DFS?

My initial intent was to do a DFS of all successors without worrying about EH regions at all, and then evaluate how involved the needed EH fixups would be. Allowing the funclets to be broken up by the RPO complicated the post-ordering adjustments, as we'd have to take care not to make handler regions non-contiguous while fixing try regions, and vice versa. It seemed easier and not too consequential in terms of perf to just skip EH successors altogether, though the lack of funclets on Windows x86 made this problem unavoidable -- my attempts to appease win-x86 EH semantics were a distraction from the fact that I could just reorder within EH regions and not break anything up... I've pushed an implementation to try that out using a DFS of all successors.

Seems likely that some significant part of the size improvement is from not moving cold blocks out of line (not splitting, per se, just separating within the method or region). I think we may want this part to be implemented before we start running in the lab.

I'll give that a try in my next push. I'll wait for the latest CI run to finish so we can see the updated diffs.

@amanasifkhalid
Copy link
Member Author

@AndyAyersMS I tried renumbering the blocks after finalizing the layout locally, and the diffs are nontrivial, though they don't lean overwhelmingly in either direction. Here they are on win-x64:

Overall (-123,813 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
benchmarks.run.windows.x64.checked.mch 8,641,509 -95 +0.03%
benchmarks.run_pgo.windows.x64.checked.mch 33,740,443 -26,237 -0.17%
benchmarks.run_tiered.windows.x64.checked.mch 12,234,066 -2,303 +0.07%
coreclr_tests.run.windows.x64.checked.mch 402,587,216 +41,538 +0.06%
libraries.crossgen2.windows.x64.checked.mch 45,403,126 -353 +0.06%
libraries.pmi.windows.x64.checked.mch 62,248,989 -17,288 +0.14%
libraries_tests.run.windows.x64.Release.mch 280,210,097 -104,800 -0.08%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 136,799,480 -11,192 -0.03%
realworld.run.windows.x64.checked.mch 13,554,143 -3,905 +0.08%
smoke_tests.nativeaot.windows.x64.checked.mch 5,016,576 +822 +0.38%
FullOpts (-123,813 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
benchmarks.run.windows.x64.checked.mch 8,641,148 -95 +0.03%
benchmarks.run_pgo.windows.x64.checked.mch 20,302,793 -26,237 -0.17%
benchmarks.run_tiered.windows.x64.checked.mch 3,063,938 -2,303 +0.07%
coreclr_tests.run.windows.x64.checked.mch 122,068,398 +41,538 +0.06%
libraries.crossgen2.windows.x64.checked.mch 45,401,740 -353 +0.06%
libraries.pmi.windows.x64.checked.mch 62,135,488 -17,288 +0.14%
libraries_tests.run.windows.x64.Release.mch 105,590,714 -104,800 -0.08%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 126,496,507 -11,192 -0.03%
realworld.run.windows.x64.checked.mch 13,148,501 -3,905 +0.08%
smoke_tests.nativeaot.windows.x64.checked.mch 5,015,585 +822 +0.38%
Details

Size improvements/regressions per collection

Collection Contexts with diffs Improvements Regressions Same size Improvements (bytes) Regressions (bytes)
benchmarks.run.windows.x64.checked.mch 8,277 2,678 2,051 3,548 -19,743 +19,648
benchmarks.run_pgo.windows.x64.checked.mch 14,470 6,546 3,062 4,862 -48,223 +21,986
benchmarks.run_tiered.windows.x64.checked.mch 3,989 1,164 725 2,100 -7,177 +4,874
coreclr_tests.run.windows.x64.checked.mch 50,530 15,285 25,649 9,596 -101,366 +142,904
libraries.crossgen2.windows.x64.checked.mch 32,542 8,501 8,938 15,103 -55,666 +55,313
libraries.pmi.windows.x64.checked.mch 47,037 15,221 13,968 17,848 -115,545 +98,257
libraries_tests.run.windows.x64.Release.mch 64,836 23,951 16,870 24,015 -264,568 +159,768
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 66,421 19,152 22,477 24,792 -153,112 +141,920
realworld.run.windows.x64.checked.mch 9,577 3,105 3,240 3,232 -31,797 +27,892
smoke_tests.nativeaot.windows.x64.checked.mch 5,544 1,829 1,816 1,899 -12,544 +13,366
303,223 97,432 98,796 106,995 -809,741 +685,928

PerfScore improvements/regressions per collection

Collection Contexts with diffs Improvements Regressions Same PerfScore Improvements (PerfScore) Regressions (PerfScore) PerfScore Overall in FullOpts
benchmarks.run.windows.x64.checked.mch 8,277 1,563 1,646 5,068 -1.61% +1.69% +0.0082%
benchmarks.run_pgo.windows.x64.checked.mch 14,470 3,354 1,841 9,275 -1.37% +1.20% -0.0518%
benchmarks.run_tiered.windows.x64.checked.mch 3,989 577 493 2,919 -1.69% +2.56% +0.0163%
coreclr_tests.run.windows.x64.checked.mch 50,530 5,662 23,029 21,839 -1.68% +0.54% +0.0121%
libraries.crossgen2.windows.x64.checked.mch 32,542 6,305 7,634 18,603 -1.91% +1.85% +0.0066%
libraries.pmi.windows.x64.checked.mch 47,037 11,118 12,760 23,159 -1.74% +2.09% +0.0216%
libraries_tests.run.windows.x64.Release.mch 64,836 15,474 10,501 38,861 -1.34% +1.51% -0.0262%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 66,421 17,299 15,269 33,853 -1.21% +1.26% -0.0067%
realworld.run.windows.x64.checked.mch 9,577 2,293 2,749 4,535 -1.37% +1.45% +0.0236%
smoke_tests.nativeaot.windows.x64.checked.mch 5,544 1,217 1,614 2,713 -2.76% +3.48% +0.0664%

Context information

Collection Diffed contexts MinOpts FullOpts Missed, base Missed, diff
benchmarks.run.windows.x64.checked.mch 27,811 4 27,807 2 (0.01%) 2 (0.01%)
benchmarks.run_pgo.windows.x64.checked.mch 94,917 48,136 46,781 14 (0.01%) 14 (0.01%)
benchmarks.run_tiered.windows.x64.checked.mch 53,445 37,214 16,231 9 (0.02%) 9 (0.02%)
coreclr_tests.run.windows.x64.checked.mch 583,206 347,435 235,771 50 (0.01%) 50 (0.01%)
libraries.crossgen2.windows.x64.checked.mch 277,538 15 277,523 31 (0.01%) 31 (0.01%)
libraries.pmi.windows.x64.checked.mch 313,896 6 313,890 156 (0.05%) 156 (0.05%)
libraries_tests.run.windows.x64.Release.mch 677,126 482,477 194,649 66 (0.01%) 66 (0.01%)
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 320,901 21,949 298,952 31 (0.01%) 31 (0.01%)
realworld.run.windows.x64.checked.mch 34,064 3 34,061 0 (0.00%) 0 (0.00%)
smoke_tests.nativeaot.windows.x64.checked.mch 31,814 12 31,802 18 (0.06%) 18 (0.06%)
2,414,718 937,251 1,477,467 377 (0.02%) 377 (0.02%)

@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented Apr 30, 2024

Also, the latest SPMI run shows the ratio of size improvements to regressions decreasing, though we're missing some collections.

CI failures are #101721 and #101685.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I still really like the shape of this -- a bit concerned with the complexity of maintaining the EH invariants, but we can live with it.

Left you a few comments to consider.

{
RETURN_ON_ABORT(func(GetFalseTarget()));
}
else if (useProfile && (GetTrueEdge()->getLikelihood() < GetFalseEdge()->getLikelihood()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I confused, or is this visiting the less likely successor first?

Suggested change
else if (useProfile && (GetTrueEdge()->getLikelihood() < GetFalseEdge()->getLikelihood()))
else if (useProfile && (GetTrueEdge()->getLikelihood() > GetFalseEdge()->getLikelihood()))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks unintuitive, but I think we need to flip the comparison so the DFS is in the order we want. Consider the following block list, pre-ordering:

BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0014]  1                             1        [???..???)-> BB02(1)                 (always)                     i keep internal
BB02 [0000]  1       BB01                  1    100 [000..00E)-> BB04(0),BB03(1)         ( cond )                     i IBC
BB03 [0010]  1       BB02                  0.50  50 [00D..00E)-> BB05(1)                 (always)                     i IBC nullcheck
BB04 [0011]  1       BB02                  0      0 [00D..00E)-> BB05(1)                 (always)                     i IBC rare
BB05 [0012]  2       BB03,BB04             1        [00D..017)                           (return)                     i hascall gcsafe

When processing BB02, if we visit BB03 before BB04, then we end up with an RPO that looks something like [<BB03's successors>, BB03, BB04, BB02, ...], so after layout, we get this:

BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0014]  1                             1        [???..???)-> BB02(1)                 (always)                     i keep internal
BB02 [0000]  1       BB01                  1    100 [000..00E)-> BB04(0),BB03(1)         ( cond )                     i IBC
BB04 [0011]  1       BB02                  0      0 [00D..00E)-> BB05(1)                 (always)                     i IBC rare
BB03 [0010]  1       BB02                  0.50  50 [00D..00E)-> BB05(1)                 (always)                     i IBC nullcheck
BB05 [0012]  2       BB03,BB04             1        [00D..017)                           (return)                     i hascall gcsafe

If we instead visit the less likely successor (BB04) first, we push the more likely successor BB03 up to BB02 in the RPO, and get this layout:

BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0014]  1                             1        [???..???)-> BB02(1)                 (always)                     i keep internal
BB02 [0000]  1       BB01                  1    100 [000..00E)-> BB04(0),BB03(1)         ( cond )                     i IBC
BB03 [0010]  1       BB02                  0.50  50 [00D..00E)-> BB05(1)                 (always)                     i IBC nullcheck
BB04 [0011]  1       BB02                  0      0 [00D..00E)-> BB05(1)                 (always)                     i IBC rare
BB05 [0012]  2       BB03,BB04             1        [00D..017)                           (return)                     i hascall gcsafe

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think because we're going to form an RPO then visiting the less-likely successor first is correct. If we wanted to view the depth-first spanning tree as a pseudo maximum weight tree then we'd do it the other way around.

Can you add a comment here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic seems very specialized to block layout and tied into how the DFS traversal works to end up with the result it wants. It makes it seem a bit odd for it to live in this very general utility function.
I'm ok with this for now, but if we end up with even more logic to handle other cases (like BBJ_SWITCH) then I'd suggest we introduce a separate version of the visitor that lives next to the block layout code. It would save a bit on throughput as well since now everyone is paying for this useProfile check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I'll fix this in a follow-up PR. As for what the new abstraction should look like, would you prefer we move the useProfile check into AllSuccessorEnumerator, or even introduce a new enumerator like ProfileGuidedSuccessorEnumerator?

Copy link
Member

@jakobbotsch jakobbotsch May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps introduce two instance initializer methods on AllSuccessorEnumerator and then pass some factory method to fgRunDfs? E.g. the normal use would be:

fgRunDfs([](SuccessorEnumerator* enumerator, BasicBlock* block) { enumerator->InitializeAllSuccs(block); }, ...);

and the block layout use could be

fgRunDfs([](SuccessorEnumerator* enumerator, BasicBlock* block) { enumerator->InitializeAllSuccsForBlockLayout(block); }, ...);

fgDoReversePostOrderLayout();
fgMoveColdBlocks();

#ifdef DEBUG
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just remove this whole #ifdef DEBUG bit, we will check the BB list at end of phase, which is imminent.

}
#endif // DEBUG

return true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it's unlikely that we won't move blocks here? The other variant tries to keep track, but I'm not sure it is worth the extra bookkeeping. All we are doing by sometimes returning false is speeding up the checked jit a tiny bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my reasoning for hard-coding the return value.

}
#endif // DEBUG

// Compute DFS of main function body's blocks, using profile data to determine the order successors are visited in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now a DFS of everything, right? Not just the main function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, updated.

return;
}

// The RPO will break up call-finally pairs, so save them before re-ordering
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems plausible that this map could be built up during the pass below, since the pair tail should follow the pair head in the RPO, so when we encounter the pair head neither block should have moved yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out; this seems to work locally. Just to make sure I don't have to add an edge case for this, we would never expect the RPO traversal to begin with a call-finally pair, right?

Copy link
Member Author

@amanasifkhalid amanasifkhalid May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach only hit one failure in a SPMI replay. I'm digging into this locally, but I adjusted my approach for now to (hopefully) be less expensive. Though with the current approach, we need to default-initialize entries in the regions ArrayStack for each EH region, so if we're going to loop over every EH clause anyway, then maybe this approach is cheaper in most cases anyway?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I like this new approach of just walking the EH table.

In principle the call finally pair tail is dominated by the call finally pair head, however depending on how we do the DFS we may see "infeasible cross flow" .. say we have a finally with two callfinallies, it should not be possible to call via one callfinally and return to the other's tail, but may look feasible in the flow graph.

}

//-----------------------------------------------------------------------------
// fgMoveColdBlocks: Move rarely-run blocks to the end of their respective regions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am curious why this doesn't need to update the try region ends?

Since this is finding the last block in region, and then doing insert before, it seems like it might leave one hot block at the end? 

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I have the same reservations about this approach; Phoenix does something similar, though the last block in a region seems to have different semantic meaning there. I could just insert the cold blocks after the last block in the region to avoid potentially pushing a hot exit block to the end, though that might introduce back edges on the cold path -- is that ok?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phoenix had various "end region" constructs so yes it was different.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I've adjusted my approach to insert cold blocks after the end of the region, and then move the insertion point to the end as well, but only if it is cold.

return false;
}

// Don't move any part of a call-finally pair -- these need to stay together
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we could handle these, by skipping the pair tail (like you do now) and then moving both if we want to move the pair head?

// Instead, set the end of the region to the BBJ_CALLFINALLY block in the pair.
// Also, don't consider blocks in handler regions.
// (If all of some try region X is enclosed in an exception handler,
// lastColdTryBlocks[X] will be null. We will handle this case later.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe be a bit more specific?

Suggested change
// lastColdTryBlocks[X] will be null. We will handle this case later.)
// lastColdTryBlocks[X] will be null. We will check for this case in tryMovingBlocks below.)

// (for example, by pushing throw blocks unreachable via normal flow to the end of the region).
// First, determine the new EH region ends.
//
BasicBlock** const tryRegionEnds = new (this, CMK_Generic) BasicBlock* [compHndBBtabCount] {};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using ArrayStack<struct> for this data (where struct contains your 3 data items). It will usually be able to avoid heap allocation.

@amanasifkhalid
Copy link
Member Author

@AndyAyersMS thank you for the review! Diffs are more conservative now that we're moving cold blocks, though still pretty big.

Since we have TP to spare, do you think we should renumber blocks post-layout in this PR? Or maybe in a follow-up?

@AndyAyersMS
Copy link
Member

Since we have TP to spare, do you think we should renumber blocks post-layout in this PR?

If we're going to do it eventually, we might as well do it now, one less churn event to sort through.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me.

It's a bit sad that we can't treat the "not in try" region symmetrically with the try regions, but I think having a bit of duplicated logic is ok and we can sort this out later if we think it matters.

I would give @jakobbotsch a chance to sign off too.

@@ -754,6 +754,9 @@ RELEASE_CONFIG_INTEGER(JitEnablePhysicalPromotion, W("JitEnablePhysicalPromotion
// Enable cross-block local assertion prop
RELEASE_CONFIG_INTEGER(JitEnableCrossBlockLocalAssertionProp, W("JitEnableCrossBlockLocalAssertionProp"), 1)

// Do greedy RPO-based layout in Compiler::fgReorderBlocks.
RELEASE_CONFIG_INTEGER(JitDoReversePostOrderLayout, W("JitDoReversePostOrderLayout"), 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to remember to undo this before merging.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I'm gonna do one more CI run with block renumbering to compare the diffs. I'll mark as no-merge for now to remind myself.

return;
}

// The RPO will break up call-finally pairs, so save them before re-ordering
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I like this new approach of just walking the EH table.

In principle the call finally pair tail is dominated by the call finally pair head, however depending on how we do the DFS we may see "infeasible cross flow" .. say we have a finally with two callfinallies, it should not be possible to call via one callfinally and return to the other's tail, but may look feasible in the flow graph.

@amanasifkhalid amanasifkhalid added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label May 2, 2024
@amanasifkhalid amanasifkhalid requested a review from jakobbotsch May 2, 2024 02:45
@amanasifkhalid
Copy link
Member Author

Here are the diffs with block renumbering. Size improvements overall are slightly bigger, and TP is slightly improved, too.

@amanasifkhalid amanasifkhalid removed the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label May 2, 2024

// Fix up call-finally pairs
//
for (BlockToBlockMap::Node* const iter : BlockToBlockMap::KeyValueIteration(&callFinallyPairs))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem like we actually use callFinallyPairs as a map, so it could presumably just be an ArrayStack of pairs.

{
if (block->hasTryIndex())
{
EHLayoutInfo& layoutInfo = regions.TopRef(block->getTryIndex());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'd suggest indexing from the bottom instead, so that things end up stored in order instead of in reverse in this array.

Copy link
Member

@jakobbotsch jakobbotsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Some minor feedback that you can feel free to address as you see fit.

@amanasifkhalid
Copy link
Member Author

@jakobbotsch @AndyAyersMS Thank you both for the reviews! No diffs now that the new layout is disabled by default.

@amanasifkhalid
Copy link
Member Author

Failures are #101721.

@amanasifkhalid amanasifkhalid merged commit 28c39bc into dotnet:main May 2, 2024
94 of 106 checks passed
@amanasifkhalid amanasifkhalid deleted the greedy-rpo branch May 2, 2024 21:04
michaelgsharp pushed a commit to michaelgsharp/runtime that referenced this pull request May 9, 2024
Part of dotnet#93020. Compiler::fgDoReversePostOrderLayout reorders blocks based on a RPO of the flowgraph's successor edges. When reordering based on the RPO, we only reorder blocks within the same EH region to avoid breaking up their contiguousness. After establishing an RPO-based layout, we do another pass to move cold blocks to the ends of their regions in fgMoveColdBlocks.

The "greedy" part of this layout isn't all that greedy just yet. For now, we use edge likelihoods to make placement decisions only for BBJ_COND blocks' successors. I plan to extend this greediness to other multi-successor block kinds (BBJ_SWITCH, etc) in a follow-up so we can independently evaluate the value in doing so.

This new layout is disabled by default for now.
Ruihan-Yin pushed a commit to Ruihan-Yin/runtime that referenced this pull request May 30, 2024
Part of dotnet#93020. Compiler::fgDoReversePostOrderLayout reorders blocks based on a RPO of the flowgraph's successor edges. When reordering based on the RPO, we only reorder blocks within the same EH region to avoid breaking up their contiguousness. After establishing an RPO-based layout, we do another pass to move cold blocks to the ends of their regions in fgMoveColdBlocks.

The "greedy" part of this layout isn't all that greedy just yet. For now, we use edge likelihoods to make placement decisions only for BBJ_COND blocks' successors. I plan to extend this greediness to other multi-successor block kinds (BBJ_SWITCH, etc) in a follow-up so we can independently evaluate the value in doing so.

This new layout is disabled by default for now.
@github-actions github-actions bot locked and limited conversation to collaborators Jun 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants