-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT-generated code run time varies wildly #106117
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
Do you know what hardware this runs on? Certain versions of certain processors are very sensitive to the alignment of branches in memory, and perhaps that is what is leading to the variation you see. cc @dotnet/jit-contrib |
@AndyAyersMS we're running it on 1U Ultra Dual AMD Epyc 7002 server with two "AMD EPYC 7343 16C/32T 3.2/3.9GHz 128MB 190W" CPUs |
We need more info and work with the author to understand what we should do. Putting it to Future because we don't have time for .NET 9. |
What other info do you need @JulieLeeMSFT ? |
Would it be possible for you to run this on Intel HW or maybe an AMD Zen4? Wondering if we're seeing something specific to the Zen3 microarchitecture here. |
So it looks like there is still some variability, but not nearly as bad as on the AMD64 Zen3 HW? The disassembly above is indeed identical modulo layout. Given that this method does a fair amount of calling and does not contain a loop it is hard to see why the layout would matter all that much. @tannergooding do we know of any microarchitectural oddities around Zen3? One thing that surprises me a little is that the profile data is pretty "thin" -- we are only seeing ~30 calls, which is the minimum we'd ever see. It's also a little unusual to see a class init call in a Tier1 method, since usually classes get initialized by the Tier0 versions. Can you share out (privately if necessary) the full As an experiment, you might try increasing the tiering threshold some, say |
Like you said, the code between the fast and slow version isn't really that much different even, it's primarily that The most notably potential quirk is Zen3 has a perf hit if there are more than two jump instructions in a 16-byte aligned window, but its not a significant hit and there isn't a loop here that would massively impact things (its more than 3 in the same window for Zen 4). My guess is the long jumps are impacting the Zen3 decode/prefetch pipeline and its causing a throughput hit. It's also possible that this is a general code alignment issue. The latest CPUs fetch in 32-64 byte aligned blocks (typically a fetch window can be done every 1-2 cycles) and then typically decode in 16-32 byte windows (physical cores typically have 2 decode units, one for each logical core in a hyperthreaded system). So even if the method is 32-byte aligned, then if the jump target differences cause a difference in whether a given target is 16-byte aligned that can negatively impact throughput. -- That's just a guess though, I don't see anything "obvious" here and would recommend profiling with something like AMD uProf (https://www.amd.com/en/developer/uprof.html) to see if it highlights any particular part of the assembly as causing the bottleneck. |
What's puzzling is that there is no fast path through this method, every call does a fair amount of work, including making numerous other calls. Even if the method is called a lot (which seems likely), the impact of fetch windows or branch density delays would not be anywhere near this bad. I wonder if dotnetTrace is giving a misleading picture. For instance, it shows one of the dominant callees is Since this is on Linux perhaps we can use |
Did a little bit more testing with the suggested So, on Zen3 differences are still there, but the generated assembly code now looks pretty much identical. See JIT summaries: I also ran it on another Zen3 machine at our disposal to exclude if this is a machine-specific hardware fault - the results show same behavior. |
@oleg-loop54 as I was saying above, I wonder if the performance issue is somewhere else. You might consider, if possible, capturing slow and fast run profiles with perfcollect and sharing those (we can find ways to keep the files private if that's a concern). Likely 15 seconds or so should be enough. |
Here are some other things you might try, if you have the time to experiment. These may reduce overall performance but may also reduce variability:
If one or more of these is stable then that might help us focus subsequent investigations. |
@oleg-loop54 it's been a few months, any news to report? |
Sorry, been occupied with other stuff. Will try to get back to you with perfcollect profiles. |
Description
We have an application that processes some network traffic. When several instances of that application is being run simultaneously on an on-premises server, processing the very same requests takes different times.
There are plenty of RAM and CPU available on the server while running these application instances. The code is the same, built on the same framework version, published as self-contained, the same executable is started several times under the same user and each instance is fed the same traffic.
Configuration
Built and published using .NET SDK 8.0.303
OS:
Linux <..> 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
, Ubuntu 24.04 LTSMain executable is built with these:
Data
Average response times for 4 instances, running simultaneously on the same server. Started roughly at the same time, traffic switched on exactly at the same time:
Did some profiling with dotnetTrace (using sampling mode), worst and best shown below:
So, MatchQueryWord function took ~13s in one case and ~19s in another, processing same ~1500 requests.
Also did
export DOTNET_JitDisasm=MatchQueryWord DOTNET_JitDisasmDiffable=1 DOTNET_JitDisasmSummary=1 DOTNET_JitStdOutFile=/tmp/engineX_jit.txt
The final code for most performant instance is:
for worst performant instance:
the biggest difference I see here is usage of
je
vsjne
just before labelG_M000_IG03
and using short jump. I find it hard to believe that this causes such huge impact (13s vs 19s).Another thing that is strange is that
IsInstanceOfClass
has different run times although generated code size (and I assume the code itself) is the same for both best and worst instances.<..> JIT compiled System.Runtime.CompilerServices.CastHelpers:IsInstanceOfClass(ulong,System.Object) [Tier1 with Dynamic PGO, IL size=97, code size=88]
It's as if one instance just "got into a wrong place" (??) and consistently runs slower than another.
If I set up a separate dedicated cloud server per instance, then repeating the same scenario yields identical processing times.
I'm stumbled where to look next, as this behavior hinders performance testing of our solution, so any suggestions would be most welcome! Thanks in advance!
The text was updated successfully, but these errors were encountered: