Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT-generated code run time varies wildly #106117

Open
oleg-loop54 opened this issue Aug 8, 2024 · 15 comments
Open

JIT-generated code run time varies wildly #106117

oleg-loop54 opened this issue Aug 8, 2024 · 15 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@oleg-loop54
Copy link

Description

We have an application that processes some network traffic. When several instances of that application is being run simultaneously on an on-premises server, processing the very same requests takes different times.

There are plenty of RAM and CPU available on the server while running these application instances. The code is the same, built on the same framework version, published as self-contained, the same executable is started several times under the same user and each instance is fed the same traffic.

Configuration

Built and published using .NET SDK 8.0.303
OS: Linux <..> 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux, Ubuntu 24.04 LTS

Main executable is built with these:

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net8.0</TargetFramework>
    <ServerGarbageCollection>true</ServerGarbageCollection>
    <GarbageCollectionAdaptationMode>1</GarbageCollectionAdaptationMode>
  </PropertyGroup>

Data

Average response times for 4 instances, running simultaneously on the same server. Started roughly at the same time, traffic switched on exactly at the same time:
perf_instance_avg

Did some profiling with dotnetTrace (using sampling mode), worst and best shown below:
perf_benchmark

So, MatchQueryWord function took ~13s in one case and ~19s in another, processing same ~1500 requests.

Also did
export DOTNET_JitDisasm=MatchQueryWord DOTNET_JitDisasmDiffable=1 DOTNET_JitDisasmSummary=1 DOTNET_JitStdOutFile=/tmp/engineX_jit.txt

The final code for most performant instance is:

; Assembly listing for method Loop54.Engine.Modules.LanguageModel.MasterNodeSearch:MatchQueryWord(Loop54.Engine.Modules.LanguageModel.DataContainer,ushort[]):System.Collections.Generic.HashSet`1[Loop54.Engine.Spellmasters.LanguageModelTypes.InternalWordType]:this (Tier1)
; Emitting BLENDED_CODE for X64 with AVX - Unix
; Tier1 code
; optimized code
; optimized using Dynamic PGO
; rbp based frame
; partially interruptible
; with Dynamic PGO: edge weights are invalid, and fgCalledCount is 30
; 6 inlinees with PGO data; 23 single block inlinees; 1 inlinees without PGO data

G_M000_IG01:
       push     rbp
       push     r15
       push     r14
       push     r13
       push     r12
       push     rbx
       push     rax
       lea      rbp, [rsp+0x30]
       mov      rbx, rsi
       mov      r15, rdx
 
G_M000_IG02:
       mov      rdi, 0xD1FFAB1E
       call     CORINFO_HELP_NEWSFAST
       mov      r14, rax
       mov      rdi, r14
       xor      rsi, rsi
       call     [System.Collections.Generic.HashSet`1[System.__Canon]:.ctor(System.Collections.Generic.IEqualityComparer`1[System.__Canon]):this]
       mov      rdi, 0xD1FFAB1E
       call     CORINFO_HELP_NEWSFAST
       mov      r13, rax
       mov      eax, dword ptr [r15+0x08]
       add      eax, 1
       jo       G_M000_IG14
       test     eax, eax
       jl       G_M000_IG15
       test     eax, eax
       jne      SHORT G_M000_IG04
 
G_M000_IG03:
       mov      rdi, 0xD1FFAB1E
       mov      esi, 0x485
       call     CORINFO_HELP_CLASSINIT_SHARED_DYNAMICCLASS
       mov      rdi, 0xD1FFAB1E
       mov      rsi, gword ptr [rdi]
       lea      rdi, bword ptr [r13+0x08]
       call     CORINFO_HELP_ASSIGN_REF
       jmp      SHORT G_M000_IG05
 
G_M000_IG04:
       movsxd   rsi, eax
       mov      rdi, 0xD1FFAB1E
       call     CORINFO_HELP_NEWARR_1_OBJ
       lea      rdi, bword ptr [r13+0x08]
       mov      rsi, rax
       call     CORINFO_HELP_ASSIGN_REF
 
G_M000_IG05:
       cmp      byte  ptr [rbx+0x7A], 0
       je       SHORT G_M000_IG08
 
G_M000_IG06:
       mov      edi, dword ptr [r15+0x08]
       cmp      edi, dword ptr [rbx+0x60]
       jl       SHORT G_M000_IG08
 
G_M000_IG07:
       mov      rdi, 0xD1FFAB1E
       call     CORINFO_HELP_NEWSFAST
       mov      r12, rax
       lea      rdi, bword ptr [r12+0x08]
       mov      rsi, r14
       call     CORINFO_HELP_ASSIGN_REF
       mov      rdi, gword ptr [rbx+0x10]
       mov      rsi, r15
       cmp      dword ptr [rdi], edi
       call     [Loop54.Engine.Spellmasters.LanguageModelTypes.TSTRootNode`1[System.__Canon]:GetReverseCandidates(ushort[]):System.Collections.Generic.IEnumerable`1[System.__Canon]:this]
       mov      rdx, rax
       mov      rsi, r12
       mov      rdi, 0xD1FFAB1E
       call     [Microsoft.FSharp.Collections.SeqModule:Iterate[System.__Canon](Microsoft.FSharp.Core.FSharpFunc`2[System.__Canon,System.__Canon],System.Collections.Generic.IEnumerable`1[System.__Canon])]
 
G_M000_IG08:
       mov      r9, 0xD1FFAB1E
       mov      r9, gword ptr [r9]
       mov      rdi, r15
       mov      rsi, rbx
       mov      rdx, r14
       mov      rcx, r13
       xor      r8d, r8d
       call     [<StartupCode$FSharpModules>.$MasterNodeSearch:traverse@353-2(ushort[],Loop54.Engine.Modules.LanguageModel.DataContainer,System.Collections.Generic.HashSet`1[Loop54.Engine.Spellmasters.LanguageModelTypes.InternalWordType],System.Collections.Generic.List`1[System.Collections.Generic.HashSet`1[System.Tuple`3[Loop54.Engine.Spellmasters.LanguageModelTypes.TSTWordNode,int,double]]],int,Microsoft.FSharp.Collections.FSharpList`1[Loop54.Engine.Spellmasters.LanguageModelTypes.InternalWordType])]
       mov      rdi, 0xD1FFAB1E
       call     CORINFO_HELP_NEWSFAST
       mov      r13, rax
       lea      rdi, bword ptr [r13+0x08]
       mov      rsi, r14
       call     CORINFO_HELP_ASSIGN_REF
       mov      rdi, 0xD1FFAB1E
       call     CORINFO_HELP_NEWSFAST
       mov      r12, rax
       lea      rdi, bword ptr [r12+0x08]
       mov      rsi, r13
       call     CORINFO_HELP_ASSIGN_REF
       mov      rdi, 0xD1FFAB1E
       call     CORINFO_HELP_NEWSFAST
       mov      r13, rax
       lea      rdi, bword ptr [r13+0x08]
       mov      rsi, r12
       call     CORINFO_HELP_ASSIGN_REF
       mov      rdi, 0xD1FFAB1E
       mov      qword ptr [r13+0x18], rdi
       mov      rdi, 0xD1FFAB1E
       call     CORINFO_HELP_NEWSFAST
       mov      r12, rax
       mov      rdi, r13
       mov      rsi, 0xD1FFAB1E
       call     [System.ArgumentNullException:ThrowIfNull(System.Object,System.String)]
       lea      rdi, bword ptr [r12+0x10]
       mov      rsi, r13
       call     CORINFO_HELP_ASSIGN_REF
       mov      edi, 2
       xor      esi, esi
       call     [System.LazyHelper:Create(int,bool):System.LazyHelper]
       lea      rdi, bword ptr [r12+0x08]
       mov      rsi, rax
       call     CORINFO_HELP_ASSIGN_REF
       mov      rdi, r12
       cmp      byte  ptr [rbx+0x79], 0
       je       SHORT G_M000_IG11
 
G_M000_IG09:
       cmp      gword ptr [rdi+0x08], 0
       jne      G_M000_IG16
       movzx    r13, byte  ptr [rdi+0x18]
 
G_M000_IG10:
       test     r13d, r13d
       je       SHORT G_M000_IG13
 
G_M000_IG11:
       mov      rax, r14
 
G_M000_IG12:
       add      rsp, 8
       pop      rbx
       pop      r12
       pop      r13
       pop      r14
       pop      r15
       pop      rbp
       ret      
 
G_M000_IG13:
       mov      rdi, 0xD1FFAB1E
       call     CORINFO_HELP_NEWSFAST
       mov      r13, rax
       lea      rdi, bword ptr [r13+0x08]
       mov      rsi, r14
       call     CORINFO_HELP_ASSIGN_REF
       mov      rdi, 0xD1FFAB1E
       call     CORINFO_HELP_NEWSFAST
       mov      r12, rax
       lea      rdi, bword ptr [r12+0x08]
       mov      rsi, rbx
       call     CORINFO_HELP_ASSIGN_REF
       mov      rdi, gword ptr [rbx+0x10]
       mov      rsi, r15
       cmp      dword ptr [rdi], edi
       call     [Loop54.Engine.Spellmasters.LanguageModelTypes.TSTRootNode`1[System.__Canon]:FindNode(ushort[]):Microsoft.FSharp.Core.FSharpOption`1[Loop54.Engine.Spellmasters.LanguageModelTypes.TSTChildNode]:this]
       test     rax, rax
       je       SHORT G_M000_IG11
       mov      rbx, gword ptr [rax+0x08]
       mov      rdi, r12
       mov      rsi, rbx
       call     [<StartupCode$FSharpModules>.$MasterNodeSearch+MatchQueryWord@405-3:Invoke(Loop54.Engine.Spellmasters.LanguageModelTypes.TSTChildNode):bool:this]
       test     eax, eax
       je       SHORT G_M000_IG11
       mov      rdi, 0xD1FFAB1E
       call     CORINFO_HELP_NEWSFAST
       mov      r15, rax
       lea      rdi, bword ptr [r15+0x08]
       mov      rsi, rbx
       call     CORINFO_HELP_ASSIGN_REF
       mov      rsi, gword ptr [r15+0x08]
       mov      rdi, r13
       call     [<StartupCode$FSharpModules>.$MasterNodeSearch+MatchQueryWord@406-1:Invoke(Loop54.Engine.Spellmasters.LanguageModelTypes.TSTChildNode):Microsoft.FSharp.Core.Unit:this]
       jmp      G_M000_IG11
 
G_M000_IG14:
       call     CORINFO_HELP_OVERFLOW
 
G_M000_IG15:
       mov      edi, 22
       mov      esi, 13
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException(int,int)]
       int3     
 
G_M000_IG16:
       call     [System.Lazy`1[bool]:CreateValue():bool:this]
       mov      r13d, eax
       jmp      G_M000_IG10
 
; Total bytes of code 699

for worst performant instance:

; Assembly listing for method Loop54.Engine.Modules.LanguageModel.MasterNodeSearch:MatchQueryWord(Loop54.Engine.Modules.LanguageModel.DataContainer,ushort[]):System.Collections.Generic.HashSet`1[Loop54.Engine.Spellmasters.LanguageModelTypes.InternalWordType]:this (Tier1)
; Emitting BLENDED_CODE for X64 with AVX - Unix
; Tier1 code
; optimized code
; optimized using Dynamic PGO
; rbp based frame
; partially interruptible
; with Dynamic PGO: edge weights are invalid, and fgCalledCount is 29
; 6 inlinees with PGO data; 23 single block inlinees; 1 inlinees without PGO data

G_M000_IG01:
       push     rbp
       push     r15
       push     r14
       push     r13
       push     r12
       push     rbx
       push     rax
       lea      rbp, [rsp+0x30]
       mov      rbx, rsi
       mov      r15, rdx
 
G_M000_IG02:
       mov      rdi, 0xD1FFAB1E
       call     CORINFO_HELP_NEWSFAST
       mov      r14, rax
       mov      rdi, r14
       xor      rsi, rsi
       call     [System.Collections.Generic.HashSet`1[System.__Canon]:.ctor(System.Collections.Generic.IEqualityComparer`1[System.__Canon]):this]
       mov      rdi, 0xD1FFAB1E
       call     CORINFO_HELP_NEWSFAST
       mov      r13, rax
       mov      eax, dword ptr [r15+0x08]
       add      eax, 1
       jo       G_M000_IG13
       test     eax, eax
       jl       G_M000_IG14
       test     eax, eax
       je       G_M000_IG11
 
G_M000_IG03:
       movsxd   rsi, eax
       mov      rdi, 0xD1FFAB1E
       call     CORINFO_HELP_NEWARR_1_OBJ
       lea      rdi, bword ptr [r13+0x08]
       mov      rsi, rax
       call     CORINFO_HELP_ASSIGN_REF
 
G_M000_IG04:
       cmp      byte  ptr [rbx+0x7A], 0
       je       SHORT G_M000_IG06
 
G_M000_IG05:
       mov      edi, dword ptr [r15+0x08]
       cmp      edi, dword ptr [rbx+0x60]
       jl       SHORT G_M000_IG06
       mov      rdi, 0xD1FFAB1E
       call     CORINFO_HELP_NEWSFAST
       mov      r12, rax
       lea      rdi, bword ptr [r12+0x08]
       mov      rsi, r14
       call     CORINFO_HELP_ASSIGN_REF
       mov      rdi, gword ptr [rbx+0x10]
       mov      rsi, r15
       cmp      dword ptr [rdi], edi
       call     [Loop54.Engine.Spellmasters.LanguageModelTypes.TSTRootNode`1[System.__Canon]:GetReverseCandidates(ushort[]):System.Collections.Generic.IEnumerable`1[System.__Canon]:this]
       mov      rdx, rax
       mov      rsi, r12
       mov      rdi, 0xD1FFAB1E
       call     [Microsoft.FSharp.Collections.SeqModule:Iterate[System.__Canon](Microsoft.FSharp.Core.FSharpFunc`2[System.__Canon,System.__Canon],System.Collections.Generic.IEnumerable`1[System.__Canon])]
 
G_M000_IG06:
       mov      r9, 0xD1FFAB1E
       mov      r9, gword ptr [r9]
       mov      rdi, r15
       mov      rsi, rbx
       mov      rdx, r14
       mov      rcx, r13
       xor      r8d, r8d
       call     [<StartupCode$FSharpModules>.$MasterNodeSearch:traverse@353-2(ushort[],Loop54.Engine.Modules.LanguageModel.DataContainer,System.Collections.Generic.HashSet`1[Loop54.Engine.Spellmasters.LanguageModelTypes.InternalWordType],System.Collections.Generic.List`1[System.Collections.Generic.HashSet`1[System.Tuple`3[Loop54.Engine.Spellmasters.LanguageModelTypes.TSTWordNode,int,double]]],int,Microsoft.FSharp.Collections.FSharpList`1[Loop54.Engine.Spellmasters.LanguageModelTypes.InternalWordType])]
       mov      rdi, 0xD1FFAB1E
       call     CORINFO_HELP_NEWSFAST
       mov      r13, rax
       lea      rdi, bword ptr [r13+0x08]
       mov      rsi, r14
       call     CORINFO_HELP_ASSIGN_REF
       mov      rdi, 0xD1FFAB1E
       call     CORINFO_HELP_NEWSFAST
       mov      r12, rax
       lea      rdi, bword ptr [r12+0x08]
       mov      rsi, r13
       call     CORINFO_HELP_ASSIGN_REF
       mov      rdi, 0xD1FFAB1E
       call     CORINFO_HELP_NEWSFAST
       mov      r13, rax
       lea      rdi, bword ptr [r13+0x08]
       mov      rsi, r12
       call     CORINFO_HELP_ASSIGN_REF
       mov      rdi, 0xD1FFAB1E
       mov      qword ptr [r13+0x18], rdi
       mov      rdi, 0xD1FFAB1E
       call     CORINFO_HELP_NEWSFAST
       mov      r12, rax
       mov      rdi, r13
       mov      rsi, 0xD1FFAB1E
       call     [System.ArgumentNullException:ThrowIfNull(System.Object,System.String)]
       lea      rdi, bword ptr [r12+0x10]
       mov      rsi, r13
       call     CORINFO_HELP_ASSIGN_REF
       mov      edi, 2
       xor      esi, esi
       call     [System.LazyHelper:Create(int,bool):System.LazyHelper]
       lea      rdi, bword ptr [r12+0x08]
       mov      rsi, rax
       call     CORINFO_HELP_ASSIGN_REF
       mov      rdi, r12
       cmp      byte  ptr [rbx+0x79], 0
       je       SHORT G_M000_IG09
 
G_M000_IG07:
       cmp      gword ptr [rdi+0x08], 0
       jne      G_M000_IG15
       movzx    r13, byte  ptr [rdi+0x18]
 
G_M000_IG08:
       test     r13d, r13d
       je       SHORT G_M000_IG12
 
G_M000_IG09:
       mov      rax, r14
 
G_M000_IG10:
       add      rsp, 8
       pop      rbx
       pop      r12
       pop      r13
       pop      r14
       pop      r15
       pop      rbp
       ret      
 
G_M000_IG11:
       mov      rdi, 0xD1FFAB1E
       mov      esi, 0x47B
       call     CORINFO_HELP_CLASSINIT_SHARED_DYNAMICCLASS
       mov      rdi, 0xD1FFAB1E
       mov      rsi, gword ptr [rdi]
       lea      rdi, bword ptr [r13+0x08]
       call     CORINFO_HELP_ASSIGN_REF
       jmp      G_M000_IG04
 
G_M000_IG12:
       mov      rdi, 0xD1FFAB1E
       call     CORINFO_HELP_NEWSFAST
       mov      r13, rax
       lea      rdi, bword ptr [r13+0x08]
       mov      rsi, r14
       call     CORINFO_HELP_ASSIGN_REF
       mov      rdi, 0xD1FFAB1E
       call     CORINFO_HELP_NEWSFAST
       mov      r12, rax
       lea      rdi, bword ptr [r12+0x08]
       mov      rsi, rbx
       call     CORINFO_HELP_ASSIGN_REF
       mov      rdi, gword ptr [rbx+0x10]
       mov      rsi, r15
       cmp      dword ptr [rdi], edi
       call     [Loop54.Engine.Spellmasters.LanguageModelTypes.TSTRootNode`1[System.__Canon]:FindNode(ushort[]):Microsoft.FSharp.Core.FSharpOption`1[Loop54.Engine.Spellmasters.LanguageModelTypes.TSTChildNode]:this]
       test     rax, rax
       je       G_M000_IG09
       mov      rbx, gword ptr [rax+0x08]
       mov      rdi, r12
       mov      rsi, rbx
       call     [<StartupCode$FSharpModules>.$MasterNodeSearch+MatchQueryWord@405-3:Invoke(Loop54.Engine.Spellmasters.LanguageModelTypes.TSTChildNode):bool:this]
       test     eax, eax
       je       G_M000_IG09
       mov      rdi, 0xD1FFAB1E
       call     CORINFO_HELP_NEWSFAST
       mov      r15, rax
       lea      rdi, bword ptr [r15+0x08]
       mov      rsi, rbx
       call     CORINFO_HELP_ASSIGN_REF
       mov      rsi, gword ptr [r15+0x08]
       mov      rdi, r13
       call     [<StartupCode$FSharpModules>.$MasterNodeSearch+MatchQueryWord@406-1:Invoke(Loop54.Engine.Spellmasters.LanguageModelTypes.TSTChildNode):Microsoft.FSharp.Core.Unit:this]
       jmp      G_M000_IG09
 
G_M000_IG13:
       call     CORINFO_HELP_OVERFLOW
 
G_M000_IG14:
       mov      edi, 22
       mov      esi, 13
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException(int,int)]
       int3     
 
G_M000_IG15:
       call     [System.Lazy`1[bool]:CreateValue():bool:this]
       mov      r13d, eax
       jmp      G_M000_IG08
 
; Total bytes of code 714

the biggest difference I see here is usage of je vs jne just before label G_M000_IG03 and using short jump. I find it hard to believe that this causes such huge impact (13s vs 19s).

Another thing that is strange is that IsInstanceOfClass has different run times although generated code size (and I assume the code itself) is the same for both best and worst instances.

<..> JIT compiled System.Runtime.CompilerServices.CastHelpers:IsInstanceOfClass(ulong,System.Object) [Tier1 with Dynamic PGO, IL size=97, code size=88]

It's as if one instance just "got into a wrong place" (??) and consistently runs slower than another.

If I set up a separate dedicated cloud server per instance, then repeating the same scenario yields identical processing times.

I'm stumbled where to look next, as this behavior hinders performance testing of our solution, so any suggestions would be most welcome! Thanks in advance!

@oleg-loop54 oleg-loop54 added the tenet-performance Performance related issue label Aug 8, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Aug 8, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Aug 8, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@AndyAyersMS
Copy link
Member

Do you know what hardware this runs on? Certain versions of certain processors are very sensitive to the alignment of branches in memory, and perhaps that is what is leading to the variation you see.

cc @dotnet/jit-contrib

@oleg-loop54
Copy link
Author

@AndyAyersMS we're running it on 1U Ultra Dual AMD Epyc 7002 server with two "AMD EPYC 7343 16C/32T 3.2/3.9GHz 128MB 190W" CPUs

@JulieLeeMSFT JulieLeeMSFT modified the milestones: 10.0.0, Future Aug 12, 2024
@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Aug 12, 2024
@JulieLeeMSFT
Copy link
Member

We need more info and work with the author to understand what we should do. Putting it to Future because we don't have time for .NET 9.

@loop-evgeny
Copy link

What other info do you need @JulieLeeMSFT ?

@AndyAyersMS
Copy link
Member

Would it be possible for you to run this on Intel HW or maybe an AMD Zen4? Wondering if we're seeing something specific to the Zen3 microarchitecture here.

@oleg-loop54
Copy link
Author

oleg-loop54 commented Aug 20, 2024

Ran the test on Intel Xeon 6210U (20 cores in total):

first run, avg service times per instance ranging from 146ms to 163 ms
test_xeon6210_1

second run, avg service times: from 171ms to 180ms
test_xeon6210_2

Also ran it on 2x EPYC 9274F (24 cores each, so 48 cores in total):

first run, avg service times: from 88ms to 115ms
test_amdzen4_1

second run, avg service times: from 90ms to 105ms
test_amdzen4_2

@AndyAyersMS
Copy link
Member

So it looks like there is still some variability, but not nearly as bad as on the AMD64 Zen3 HW?

The disassembly above is indeed identical modulo layout. Given that this method does a fair amount of calling and does not contain a loop it is hard to see why the layout would matter all that much.

@tannergooding do we know of any microarchitectural oddities around Zen3?

One thing that surprises me a little is that the profile data is pretty "thin" -- we are only seeing ~30 calls, which is the minimum we'd ever see. It's also a little unusual to see a class init call in a Tier1 method, since usually classes get initialized by the Tier0 versions.

Can you share out (privately if necessary) the full DOTNET_JitDisasmSummary?

As an experiment, you might try increasing the tiering threshold some, say DOTNET_TC_CallCountThreshold=200 (value is parsed as hex, so this should mean 512 calls), to see if that leads to more predictable results.

@tannergooding
Copy link
Member

tannergooding commented Aug 21, 2024

do we know of any microarchitectural oddities around Zen3?

Like you said, the code between the fast and slow version isn't really that much different even, it's primarily that G_M000_IG03 is instead placed at G_M000_IG11 and so 3 jumps become long jumps rather than the short jumps they are in the fast version.

The most notably potential quirk is Zen3 has a perf hit if there are more than two jump instructions in a 16-byte aligned window, but its not a significant hit and there isn't a loop here that would massively impact things (its more than 3 in the same window for Zen 4).

My guess is the long jumps are impacting the Zen3 decode/prefetch pipeline and its causing a throughput hit. It's also possible that this is a general code alignment issue. The latest CPUs fetch in 32-64 byte aligned blocks (typically a fetch window can be done every 1-2 cycles) and then typically decode in 16-32 byte windows (physical cores typically have 2 decode units, one for each logical core in a hyperthreaded system). So even if the method is 32-byte aligned, then if the jump target differences cause a difference in whether a given target is 16-byte aligned that can negatively impact throughput.

-- That's just a guess though, I don't see anything "obvious" here and would recommend profiling with something like AMD uProf (https://www.amd.com/en/developer/uprof.html) to see if it highlights any particular part of the assembly as causing the bottleneck.

@AndyAyersMS
Copy link
Member

What's puzzling is that there is no fast path through this method, every call does a fair amount of work, including making numerous other calls. Even if the method is called a lot (which seems likely), the impact of fetch windows or branch density delays would not be anywhere near this bad.

I wonder if dotnetTrace is giving a misleading picture. For instance, it shows one of the dominant callees is IsInstanceOfClass, but from the disasm it's not an immediate callee. Maybe one of the immediate callees tail calls into this helper, but that seems a bit unlikely. On the other hand some bits here are clearly from F# so perhaps there are tail calls.

Since this is on Linux perhaps we can use perf to drill in...?

@oleg-loop54
Copy link
Author

oleg-loop54 commented Aug 26, 2024

Did a little bit more testing with the suggested DOTNET_TC_CallCountThreshold=200:
test_amdzen3_callcount512

So, on Zen3 differences are still there, but the generated assembly code now looks pretty much identical. See JIT summaries:
zen3_callcount512.zip

I also ran it on another Zen3 machine at our disposal to exclude if this is a machine-specific hardware fault - the results show same behavior.

@AndyAyersMS
Copy link
Member

@oleg-loop54 as I was saying above, I wonder if the performance issue is somewhere else.

You might consider, if possible, capturing slow and fast run profiles with perfcollect and sharing those (we can find ways to keep the files private if that's a concern).

Likely 15 seconds or so should be enough.

@AndyAyersMS
Copy link
Member

Here are some other things you might try, if you have the time to experiment. These may reduce overall performance but may also reduce variability:

  1. Disable PGO: DOTNET_TieredPGO=0
  2. Disable OSR: DOTNET_TC_OnStackReplacement=0
  3. Disable both PGO and OSR
  4. Disable Tiering: DOTNET_TieredCompilation=0 (this implicitly disables both PGO and OSR, since they rely on tiering)

If one or more of these is stable then that might help us focus subsequent investigations.

@AndyAyersMS
Copy link
Member

@oleg-loop54 it's been a few months, any news to report?

@oleg-loop54
Copy link
Author

Sorry, been occupied with other stuff.
One thing is that I've managed to run the app with disabled PGO (just point 1 in your list) and compiled in ReadyToRun format. I noticed that variance in run times was lower, but still quite high (I don't have the numbers, unfortunately).

Will try to get back to you with perfcollect profiles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

5 participants