-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Implement greedy RPO-based block layout #101473
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
cc @dotnet/jit-contrib, @AndyAyersMS PTAL. Diffs are big, though In my previous iterations of this PR, I wanted the block order to mirror the RPO as closely as possible, and repair try regions as necessary afterwards. I got this approach working when using funclets, but Windows x86's EH model complicated this by allowing handlers in the main method body too... I tried iterating on my initial implementation to handle the non-funclet EH case, but I was approaching two separate solutions, which didn't seem ideal -- plus all of these local fixups is something we're trying to get away from in the old layout approach. The diffs don't seem any more or less extreme by restricting reordering to within EH regions, and the code is quite a bit easier to follow. |
Seems likely that some significant part of the size improvement is from not moving cold blocks out of line (not splitting, per se, just separating within the method or region). I think we may want this part to be implemented before we start running in the lab. It shouldn't be too hard to implement something like this: starting at the end of each EH region, move sufficiently cold blocks to the end, keeping them in relative RPO order. I'm ok taking that as a follow up, if you think it makes sense. Though if it is indeed that simple it would not complicate this PR too much, and this PR is delightfully simple compared to the code it will replace. You may recall in the example code I referred you as a model to there are some exceptions to this rule, and perhaps some thresholds to consider -- for instance sometimes we might not want to not move very small cold blocks that have hot pred and succ blocks, if we're fairly confident in the data, assuming we'd end up with denser code by keeping the branch short even if it means having a (well predicted) taken branch on the hot path. |
Can you explain a bit further why you want the regular successor DFS? It saddens me a bit to see it return after the work I did last year to move everything to the faithful DFS. When I did that work I didn't find any justified uses of the old DFS. For this case, why can't we run run the faithful DFS (say with your modifications to change the visit order) and then partition the RPO into multiple ones by EH region? I don't quite understand why we need to give up on reordering of all EH regions (or regions in the main body that are dominated by an EH region). |
My initial intent was to do a DFS of all successors without worrying about EH regions at all, and then evaluate how involved the needed EH fixups would be. Allowing the funclets to be broken up by the RPO complicated the post-ordering adjustments, as we'd have to take care not to make handler regions non-contiguous while fixing try regions, and vice versa. It seemed easier and not too consequential in terms of perf to just skip EH successors altogether, though the lack of funclets on Windows x86 made this problem unavoidable -- my attempts to appease win-x86 EH semantics were a distraction from the fact that I could just reorder within EH regions and not break anything up... I've pushed an implementation to try that out using a DFS of all successors.
I'll give that a try in my next push. I'll wait for the latest CI run to finish so we can see the updated diffs. |
@AndyAyersMS I tried renumbering the blocks after finalizing the layout locally, and the diffs are nontrivial, though they don't lean overwhelmingly in either direction. Here they are on win-x64: Overall (-123,813 bytes)
FullOpts (-123,813 bytes)
DetailsSize improvements/regressions per collection
PerfScore improvements/regressions per collection
Context information
|
Also, the latest SPMI run shows the ratio of size improvements to regressions decreasing, though we're missing some collections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I still really like the shape of this -- a bit concerned with the complexity of maintaining the EH invariants, but we can live with it.
Left you a few comments to consider.
{ | ||
RETURN_ON_ABORT(func(GetFalseTarget())); | ||
} | ||
else if (useProfile && (GetTrueEdge()->getLikelihood() < GetFalseEdge()->getLikelihood())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I confused, or is this visiting the less likely successor first?
else if (useProfile && (GetTrueEdge()->getLikelihood() < GetFalseEdge()->getLikelihood())) | |
else if (useProfile && (GetTrueEdge()->getLikelihood() > GetFalseEdge()->getLikelihood())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks unintuitive, but I think we need to flip the comparison so the DFS is in the order we want. Consider the following block list, pre-ordering:
BBnum BBid ref try hnd preds weight IBC [IL range] [jump] [EH region] [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0014] 1 1 [???..???)-> BB02(1) (always) i keep internal
BB02 [0000] 1 BB01 1 100 [000..00E)-> BB04(0),BB03(1) ( cond ) i IBC
BB03 [0010] 1 BB02 0.50 50 [00D..00E)-> BB05(1) (always) i IBC nullcheck
BB04 [0011] 1 BB02 0 0 [00D..00E)-> BB05(1) (always) i IBC rare
BB05 [0012] 2 BB03,BB04 1 [00D..017) (return) i hascall gcsafe
When processing BB02
, if we visit BB03
before BB04
, then we end up with an RPO that looks something like [<BB03's successors>, BB03, BB04, BB02, ...]
, so after layout, we get this:
BBnum BBid ref try hnd preds weight IBC [IL range] [jump] [EH region] [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0014] 1 1 [???..???)-> BB02(1) (always) i keep internal
BB02 [0000] 1 BB01 1 100 [000..00E)-> BB04(0),BB03(1) ( cond ) i IBC
BB04 [0011] 1 BB02 0 0 [00D..00E)-> BB05(1) (always) i IBC rare
BB03 [0010] 1 BB02 0.50 50 [00D..00E)-> BB05(1) (always) i IBC nullcheck
BB05 [0012] 2 BB03,BB04 1 [00D..017) (return) i hascall gcsafe
If we instead visit the less likely successor (BB04
) first, we push the more likely successor BB03
up to BB02
in the RPO, and get this layout:
BBnum BBid ref try hnd preds weight IBC [IL range] [jump] [EH region] [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0014] 1 1 [???..???)-> BB02(1) (always) i keep internal
BB02 [0000] 1 BB01 1 100 [000..00E)-> BB04(0),BB03(1) ( cond ) i IBC
BB03 [0010] 1 BB02 0.50 50 [00D..00E)-> BB05(1) (always) i IBC nullcheck
BB04 [0011] 1 BB02 0 0 [00D..00E)-> BB05(1) (always) i IBC rare
BB05 [0012] 2 BB03,BB04 1 [00D..017) (return) i hascall gcsafe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think because we're going to form an RPO then visiting the less-likely successor first is correct. If we wanted to view the depth-first spanning tree as a pseudo maximum weight tree then we'd do it the other way around.
Can you add a comment here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic seems very specialized to block layout and tied into how the DFS traversal works to end up with the result it wants. It makes it seem a bit odd for it to live in this very general utility function.
I'm ok with this for now, but if we end up with even more logic to handle other cases (like BBJ_SWITCH
) then I'd suggest we introduce a separate version of the visitor that lives next to the block layout code. It would save a bit on throughput as well since now everyone is paying for this useProfile
check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I'll fix this in a follow-up PR. As for what the new abstraction should look like, would you prefer we move the useProfile
check into AllSuccessorEnumerator
, or even introduce a new enumerator like ProfileGuidedSuccessorEnumerator
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps introduce two instance initializer methods on AllSuccessorEnumerator
and then pass some factory method to fgRunDfs
? E.g. the normal use would be:
fgRunDfs([](SuccessorEnumerator* enumerator, BasicBlock* block) { enumerator->InitializeAllSuccs(block); }, ...);
and the block layout use could be
fgRunDfs([](SuccessorEnumerator* enumerator, BasicBlock* block) { enumerator->InitializeAllSuccsForBlockLayout(block); }, ...);
src/coreclr/jit/fgopt.cpp
Outdated
fgDoReversePostOrderLayout(); | ||
fgMoveColdBlocks(); | ||
|
||
#ifdef DEBUG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just remove this whole #ifdef DEBUG
bit, we will check the BB list at end of phase, which is imminent.
} | ||
#endif // DEBUG | ||
|
||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume it's unlikely that we won't move blocks here? The other variant tries to keep track, but I'm not sure it is worth the extra bookkeeping. All we are doing by sometimes returning false is speeding up the checked jit a tiny bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my reasoning for hard-coding the return value.
src/coreclr/jit/fgopt.cpp
Outdated
} | ||
#endif // DEBUG | ||
|
||
// Compute DFS of main function body's blocks, using profile data to determine the order successors are visited in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now a DFS of everything, right? Not just the main function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, updated.
return; | ||
} | ||
|
||
// The RPO will break up call-finally pairs, so save them before re-ordering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems plausible that this map could be built up during the pass below, since the pair tail should follow the pair head in the RPO, so when we encounter the pair head neither block should have moved yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out; this seems to work locally. Just to make sure I don't have to add an edge case for this, we would never expect the RPO traversal to begin with a call-finally pair, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach only hit one failure in a SPMI replay. I'm digging into this locally, but I adjusted my approach for now to (hopefully) be less expensive. Though with the current approach, we need to default-initialize entries in the regions
ArrayStack for each EH region, so if we're going to loop over every EH clause anyway, then maybe this approach is cheaper in most cases anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I like this new approach of just walking the EH table.
In principle the call finally pair tail is dominated by the call finally pair head, however depending on how we do the DFS we may see "infeasible cross flow" .. say we have a finally with two callfinallies, it should not be possible to call via one callfinally and return to the other's tail, but may look feasible in the flow graph.
} | ||
|
||
//----------------------------------------------------------------------------- | ||
// fgMoveColdBlocks: Move rarely-run blocks to the end of their respective regions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am curious why this doesn't need to update the try region ends?
Since this is finding the last block in region, and then doing insert before, it seems like it might leave one hot block at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I have the same reservations about this approach; Phoenix does something similar, though the last block in a region seems to have different semantic meaning there. I could just insert the cold blocks after the last block in the region to avoid potentially pushing a hot exit block to the end, though that might introduce back edges on the cold path -- is that ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phoenix had various "end region" constructs so yes it was different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I've adjusted my approach to insert cold blocks after the end of the region, and then move the insertion point to the end as well, but only if it is cold.
src/coreclr/jit/fgopt.cpp
Outdated
return false; | ||
} | ||
|
||
// Don't move any part of a call-finally pair -- these need to stay together |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we could handle these, by skipping the pair tail (like you do now) and then moving both if we want to move the pair head?
src/coreclr/jit/fgopt.cpp
Outdated
// Instead, set the end of the region to the BBJ_CALLFINALLY block in the pair. | ||
// Also, don't consider blocks in handler regions. | ||
// (If all of some try region X is enclosed in an exception handler, | ||
// lastColdTryBlocks[X] will be null. We will handle this case later.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe be a bit more specific?
// lastColdTryBlocks[X] will be null. We will handle this case later.) | |
// lastColdTryBlocks[X] will be null. We will check for this case in tryMovingBlocks below.) |
src/coreclr/jit/fgopt.cpp
Outdated
// (for example, by pushing throw blocks unreachable via normal flow to the end of the region). | ||
// First, determine the new EH region ends. | ||
// | ||
BasicBlock** const tryRegionEnds = new (this, CMK_Generic) BasicBlock* [compHndBBtabCount] {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using ArrayStack<struct>
for this data (where struct
contains your 3 data items). It will usually be able to avoid heap allocation.
@AndyAyersMS thank you for the review! Diffs are more conservative now that we're moving cold blocks, though still pretty big. Since we have TP to spare, do you think we should renumber blocks post-layout in this PR? Or maybe in a follow-up? |
If we're going to do it eventually, we might as well do it now, one less churn event to sort through. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good to me.
It's a bit sad that we can't treat the "not in try" region symmetrically with the try regions, but I think having a bit of duplicated logic is ok and we can sort this out later if we think it matters.
I would give @jakobbotsch a chance to sign off too.
src/coreclr/jit/jitconfigvalues.h
Outdated
@@ -754,6 +754,9 @@ RELEASE_CONFIG_INTEGER(JitEnablePhysicalPromotion, W("JitEnablePhysicalPromotion | |||
// Enable cross-block local assertion prop | |||
RELEASE_CONFIG_INTEGER(JitEnableCrossBlockLocalAssertionProp, W("JitEnableCrossBlockLocalAssertionProp"), 1) | |||
|
|||
// Do greedy RPO-based layout in Compiler::fgReorderBlocks. | |||
RELEASE_CONFIG_INTEGER(JitDoReversePostOrderLayout, W("JitDoReversePostOrderLayout"), 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to remember to undo this before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I'm gonna do one more CI run with block renumbering to compare the diffs. I'll mark as no-merge for now to remind myself.
return; | ||
} | ||
|
||
// The RPO will break up call-finally pairs, so save them before re-ordering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I like this new approach of just walking the EH table.
In principle the call finally pair tail is dominated by the call finally pair head, however depending on how we do the DFS we may see "infeasible cross flow" .. say we have a finally with two callfinallies, it should not be possible to call via one callfinally and return to the other's tail, but may look feasible in the flow graph.
Here are the diffs with block renumbering. Size improvements overall are slightly bigger, and TP is slightly improved, too. |
src/coreclr/jit/fgopt.cpp
Outdated
|
||
// Fix up call-finally pairs | ||
// | ||
for (BlockToBlockMap::Node* const iter : BlockToBlockMap::KeyValueIteration(&callFinallyPairs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't seem like we actually use callFinallyPairs
as a map, so it could presumably just be an ArrayStack
of pairs.
src/coreclr/jit/fgopt.cpp
Outdated
{ | ||
if (block->hasTryIndex()) | ||
{ | ||
EHLayoutInfo& layoutInfo = regions.TopRef(block->getTryIndex()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I'd suggest indexing from the bottom instead, so that things end up stored in order instead of in reverse in this array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Some minor feedback that you can feel free to address as you see fit.
@jakobbotsch @AndyAyersMS Thank you both for the reviews! No diffs now that the new layout is disabled by default. |
Failures are #101721. |
Part of dotnet#93020. Compiler::fgDoReversePostOrderLayout reorders blocks based on a RPO of the flowgraph's successor edges. When reordering based on the RPO, we only reorder blocks within the same EH region to avoid breaking up their contiguousness. After establishing an RPO-based layout, we do another pass to move cold blocks to the ends of their regions in fgMoveColdBlocks. The "greedy" part of this layout isn't all that greedy just yet. For now, we use edge likelihoods to make placement decisions only for BBJ_COND blocks' successors. I plan to extend this greediness to other multi-successor block kinds (BBJ_SWITCH, etc) in a follow-up so we can independently evaluate the value in doing so. This new layout is disabled by default for now.
Part of dotnet#93020. Compiler::fgDoReversePostOrderLayout reorders blocks based on a RPO of the flowgraph's successor edges. When reordering based on the RPO, we only reorder blocks within the same EH region to avoid breaking up their contiguousness. After establishing an RPO-based layout, we do another pass to move cold blocks to the ends of their regions in fgMoveColdBlocks. The "greedy" part of this layout isn't all that greedy just yet. For now, we use edge likelihoods to make placement decisions only for BBJ_COND blocks' successors. I plan to extend this greediness to other multi-successor block kinds (BBJ_SWITCH, etc) in a follow-up so we can independently evaluate the value in doing so. This new layout is disabled by default for now.
Part of #93020.
Compiler::fgDoReversePostOrderLayout
reorders blocks based on a RPO of the flowgraph's successor edges. When reordering based on the RPO, we only reorder blocks within the same EH region to avoid breaking up their contiguousness. After establishing an RPO-based layout, we do another pass to move cold blocks to the ends of their regions infgMoveColdBlocks
.The "greedy" part of this layout isn't all that greedy just yet. For now, we use edge likelihoods to make placement decisions only for
BBJ_COND
blocks' successors. I plan to extend this greediness to other multi-successor block kinds (BBJ_SWITCH
, etc) in a follow-up so we can independently evaluate the value in doing so.This new layout is disabled by default for now.