Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skynet failures on larger machines #16652

Closed
0xdaryl opened this issue Feb 2, 2023 · 10 comments · Fixed by #16713
Closed

Skynet failures on larger machines #16652

0xdaryl opened this issue Feb 2, 2023 · 10 comments · Fixed by #16713

Comments

@0xdaryl
Copy link
Contributor

0xdaryl commented Feb 2, 2023

The Skynet test fails readily (maybe 1 in 5 runs) in a variety of different ways when run on larger machines. For example, an Intel Cascade Lake with 112 hardware threads.

When a failure occurs, it occurs almost immediately.

As the hardware thread count is lowered (by changing machines or even via numactl) the failures do not appear or are much more intermittent.

java -version

openjdk version "19.0.2-internal" 2023-01-17
OpenJDK Runtime Environment (build 19.0.2-internal-adhoc.jenkins.openj9-openjdk-jdk19)
Eclipse OpenJ9 VM (build master-b67e11a33, JRE 19 Linux amd64-64-Bit Compressed References 20230202_000000 (JIT enabled, AOT enabled)
OpenJ9   - 730f41024
OMR      - e5f74776c
JCL      - be0b953d126 based on jdk-19.0.2+7)
java -Xshareclasses:none --enable-preview -Xnocompressedrefs -Xmx512M Skynet

Example failures:

1). GC crashes/asserts

Unhandled exception
Type=Segmentation error vmState=0x00000000
J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
Handler1=00007FFB68C674E0 Handler2=00007FFB6A11CB40 InaccessibleAddress=0000000000000000
RDI=0000000000000001 RSI=0000000000000400 RAX=0000000000000000 RBX=00007FF895F7B920
RCX=00007FFB5C6A00F8 RDX=00007FFB647D5C00 R8=00007FFB64874740 R9=0000000000000113
R10=00007FF95C010C00 R11=0000000000000014 R12=0000000000000112 R13=00007FFB5CA51C00
R14=00007FFB6485AC95 R15=00007FFB3D142468
RIP=00007FFB68CF23FF GS=0000 FS=0000 RSP=00007FF895F7B5C0
EFlags=0000000000010246 CS=0033 RBP=00007FF87404A340 ERR=0000000000000004
TRAPNO=000000000000000E OLDMASK=0000000000000000 CR2=0000000000000000
xmm0 00007ffb6485a9f5 (f: 1686481408.000000, d: 6.952378e-310)
xmm1 00007ff87404a2c8 (f: 1946460928.000000, d: 6.951754e-310)
xmm2 00007ffb6434511a (f: 1681150208.000000, d: 6.952378e-310)
xmm3 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm4 00007ff87404a360 (f: 1946461056.000000, d: 6.951754e-310)
xmm5 00007ff89c010b08 (f: 2617314048.000000, d: 6.951788e-310)
xmm6 00007ffb6485a9f5 (f: 1686481408.000000, d: 6.952378e-310)
xmm7 00007ffb6433fd00 (f: 1681128704.000000, d: 6.952378e-310)
xmm8 00007ff95c010821 (f: 1543571456.000000, d: 6.951947e-310)
xmm9 00007ffb643c43d0 (f: 1681671168.000000, d: 6.952378e-310)
xmm10 00007ff89c010b08 (f: 2617314048.000000, d: 6.951788e-310)
xmm11 00007ff900000000 (f: 0.000000, d: 6.951870e-310)
xmm12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm13 0000000063dbcfcc (f: 1675350016.000000, d: 8.277329e-315)
xmm14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm15 00007ffb68e3aaf0 (f: 1759750912.000000, d: 6.952382e-310)
Module=/home/daryl/20230202/jdk/lib/default/libj9vm29.so
Module_base_address=00007FFB68C29000
Target=2_90_20230202_000000 (Linux 3.10.0-1160.83.1.el7.x86_64)
CPU=amd64 (96 logical CPUs) (0x2e7c242000 RAM)
----------- Stack Backtrace -----------
_ZN26VM_BytecodeInterpreterFull3runEP10J9VMThread+0xb17f (0x00007FFB68CF23FF [libj9vm29.so+0xc93ff])
bytecodeLoopFull+0xad (0x00007FFB68CE726D [libj9vm29.so+0xbe26d])
 (0x00007FFB68D79602 [libj9vm29.so+0x150602])
---------------------------------------
JVMDUMP039I Processing dump event "gpf", detail "" at 2023/02/02 06:59:24 - please wait.
java: /home/jenkins/daryl/latest/openj9-openjdk-jdk19/openj9/runtime/gc_glue_java/MarkingDelegate.hpp:129: GC_ObjectScanner* MM_MarkingDelegate::getObjectScanner(MM_EnvironmentBase*, omrobjectptr_t, void*, MM_MarkingSchemeScanReason, uintptr_t*): Assertion `false && ((uintptr_t)0x99669966 == clazz->eyecatcher)' failed.
JVMDUMP039I Processing dump event "abort", detail "" at 2023/02/02 06:59:24 - please wait.
  1. java.lang.IncompatibleClassChangeError
Exception in thread "" java.lang.IncompatibleClassChangeError
	at java.base/java.util.concurrent.ForkJoinPool.unmanagedBlock(ForkJoinPool.java:3745)
	at java.base/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3690)
	at java.base/java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:397)
	at java.base/java.util.concurrent.SynchronousQueue.take(SynchronousQueue.java:886)
	at Skynet$Channel.receive(Skynet.java:110)
	at Skynet.skynet(Skynet.java:78)
	at Skynet.lambda$skynet$1(Skynet.java:74)
	at java.base/java.lang.VirtualThread.run(VirtualThread.java:292)
	at java.base/java.lang.VirtualThread$VThreadContinuation.lambda$new$0(VirtualThread.java:179)
	at java.base/jdk.internal.vm.Continuation.execute(Continuation.java:170)
  1. Stack walker crash
Unhandled exception
Type=Segmentation error vmState=0x00000000
J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
Handler1=00007FE57B908660 Handler2=00007FE580E95B40 InaccessibleAddress=0000000000000000
RDI=00007FE2D8003400 RSI=00007FE53A32B738 RAX=0000000000000000 RBX=00007FE2D8003400
RCX=0000000000000003 RDX=00007FE57C7D8000 R8=00007FE573CAE2E0 R9=00007FE5731A3500
R10=00007FE5546C96FC R11=0000000000000000 R12=0000000000000020 R13=00007FE57C7D8200
R14=0000000000000008 R15=00007FE57C3C7E00
RIP=00007FE574B881F0 GS=0000 FS=0000 RSP=00007FE3A03E49A0
EFlags=0000000000010206 CS=0033 RBP=00007FE5731A3500 ERR=0000000000000004
TRAPNO=000000000000000E OLDMASK=0000000000000000 CR2=0000000000000000
xmm0 00007fe368089fa0 (f: 1745395584.000000, d: 6.947288e-310)
xmm1 00007fe36808a1e0 (f: 1745396224.000000, d: 6.947288e-310)
xmm2 00007fe57baa6203 (f: 2074763776.000000, d: 6.947729e-310)
xmm3 00007fe554bb9230 (f: 1421578752.000000, d: 6.947697e-310)
xmm4 00007fe300000000 (f: 0.000000, d: 6.947202e-310)
xmm5 0000000000000001 (f: 1.000000, d: 4.940656e-324)
xmm6 00007fe57c825700 (f: 2088916736.000000, d: 6.947730e-310)
xmm7 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm8 00007fe2d80092d1 (f: 3623916288.000000, d: 6.947169e-310)
xmm9 00007fe57c85a840 (f: 2089134080.000000, d: 6.947730e-310)
xmm10 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm11 00007fe200000000 (f: 0.000000, d: 6.946990e-310)
xmm12 00007fe2d8009198 (f: 3623916032.000000, d: 6.947169e-310)
xmm13 00007fe2d8009238 (f: 3623916032.000000, d: 6.947169e-310)
xmm14 00007fe5546a8a30 (f: 1416268288.000000, d: 6.947696e-310)
xmm15 00007fe5543be248 (f: 1413210752.000000, d: 6.947696e-310)
Module=/home/daryl/20230127/jdk/lib/default/libj9jit29.so
Module_base_address=00007FE5741AE000
Target=2_90_20230123_000000 (Linux 3.10.0-1160.80.1.el7.x86_64)
CPU=amd64 (96 logical CPUs) (0x2e7c242000 RAM)
----------- Stack Backtrace -----------
old_fast_jitLookupInterfaceMethod+0x50 (0x00007FE574B881F0 [libj9jit29.so+0x9da1f0])
 (0x00007FE574B95917 [libj9jit29.so+0x9e7917])
---------------------------------------
JVMDUMP039I Processing dump event "gpf", detail "" at 2023/01/27 19:04:00 - please wait.
Unhandled exception
Type=Segmentation error vmState=0x00020002
J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
Handler1=00007FE57B9086A0 Handler2=00007FE580E95B40 InaccessibleAddress=0000000000000050
RDI=0000000000000000 RSI=00007FE573CAE2E0 RAX=00007FE5546C9600 RBX=00007FE512C997D0
RCX=00007FE5804EC829 RDX=00007FE53000D040 R8=0000000000000000 R9=00007FE58042D2CD
R10=00007FE368089FA0 R11=0000000000000206 R12=0000000000000000 R13=0000000004600002
R14=0000000000700000 R15=0000000000000000
RIP=00007FE574B93E04 GS=0000 FS=0000 RSP=00007FE512C996F0
EFlags=0000000000010246 CS=0033 RBP=0000000000000001 ERR=0000000000000004
TRAPNO=000000000000000E OLDMASK=0000000000000000 CR2=0000000000000050
xmm0 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm1 2525252525252525 (f: 623191360.000000, d: 9.532824e-130)
xmm2 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm3 0000000000ff0000 (f: 16711680.000000, d: 8.256667e-317)
xmm4 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm5 0000003000000020 (f: 32.000000, d: 1.018558e-312)
xmm6 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm7 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm8 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm9 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm10 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm11 0000ff0000000000 (f: 0.000000, d: 1.385239e-309)
xmm12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm13 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
Module=/home/daryl/20230127/jdk/lib/default/libj9jit29.so
Module_base_address=00007FE5741AE000
Target=2_90_20230123_000000 (Linux 3.10.0-1160.80.1.el7.x86_64)
CPU=amd64 (96 logical CPUs) (0x2e7c242000 RAM)
----------- Stack Backtrace -----------
jitWalkStackFrames+0x644 (0x00007FE574B93E04 [libj9jit29.so+0x9e5e04])
walkStackFrames+0xa6 (0x00007FE57B94AA26 [libj9vm29.so+0x80a26])
_ZN28GC_VMThreadStackSlotIterator9scanSlotsEP10J9VMThreadS1_PvPFvP8J9JavaVMPP8J9ObjectS2_P16J9StackWalkStatePKvEbb+0x3c (0x00007FE573F2EDBC [libj9gc_full29.so+0x43dbc])
_ZN14MM_RootScanner13scanOneThreadEP18MM_EnvironmentBaseP10J9VMThreadPv+0x10d (0x00007FE573F2612D [libj9gc_full29.so+0x3b12d])
_ZN14MM_RootScanner11scanThreadsEP18MM_EnvironmentBase+0xbf (0x00007FE573F2532F [libj9gc_full29.so+0x3a32f])
_ZN14MM_RootScanner9scanRootsEP18MM_EnvironmentBase+0x43 (0x00007FE573F280F3 [libj9gc_full29.so+0x3d0f3])
_ZN18MM_MarkingDelegate9scanRootsEP18MM_EnvironmentBaseb+0x125 (0x00007FE574086385 [libj9gc_full29.so+0x19b385])
_ZN19MM_ParallelMarkTask3runEP18MM_EnvironmentBase+0x79 (0x00007FE5740ACD39 [libj9gc_full29.so+0x1c1d39])
_ZN21MM_ParallelDispatcher16workerEntryPointEP18MM_EnvironmentBase+0x228 (0x00007FE574018A38 [libj9gc_full29.so+0x12da38])
_Z23dispatcher_thread_proc2P14OMRPortLibraryPv+0x109 (0x00007FE574018249 [libj9gc_full29.so+0x12d249])
omrsig_protect+0x2b1 (0x00007FE580E96871 [libj9prt29.so+0x2a871])
dispatcher_thread_proc+0x3f (0x00007FE57401737F [libj9gc_full29.so+0x12c37f])
thread_wrapper+0x163 (0x00007FE580E5E333 [libj9thr29.so+0xb333])
start_thread+0xc5 (0x00007FE5809B9EA5 [libpthread.so.0+0x7ea5])
clone+0x6d (0x00007FE5804DEB0D [libc.so.6+0xfeb0d])
---------------------------------------
@0xdaryl
Copy link
Contributor Author

0xdaryl commented Feb 2, 2023

FYI @fengxue-IS @tajila @pshipton

@tajila
Copy link
Contributor

tajila commented Feb 2, 2023

@fengxue-IS Can you see if you can increase reproducability no smaller machines by setting -Djdk.virtualThreadScheduler.parallelism/maxPoolSize/minRunnable to a high valuees

@fengxue-IS fengxue-IS self-assigned this Feb 2, 2023
@fengxue-IS
Copy link
Contributor

I was able to reproduce this locally on a 8 core xlinux machine using -Djdk.virtualThreadScheduler.parallelism=112

----------- Stack Backtrace -----------
_ZN26MM_MarkingSchemeRootMarker11doStackSlotEPP8J9ObjectPvPKv+0x39a (0x00007FFAFAC5029A [libj9gc29.so+0x1a729a])
walkJITFrameSlots.constprop.0+0xb7 (0x00007FFAFB643C67 [libj9jit29.so+0x8d1c67])
jitWalkFrame+0x2fd (0x00007FFAFB6440ED [libj9jit29.so+0x8d20ed])
jitWalkStackFrames+0x10b4 (0x00007FFAFB645644 [libj9jit29.so+0x8d3644])
walkStackFrames+0xaa (0x00007FFAFBE6258A [libj9vm29.so+0x8158a])
_ZN28GC_VMThreadStackSlotIterator9scanSlotsEP10J9VMThreadS1_PvPFvP8J9JavaVMPP8J9ObjectS2_P16J9StackWalkStatePKvEbb+0x40 (0x00007FFAFAAEE560 [libj9gc29.so+0x45560])
_ZN14MM_RootScanner13scanOneThreadEP18MM_EnvironmentBaseP10J9VMThreadPv+0x111 (0x00007FFAFAAE4EF1 [libj9gc29.so+0x3bef1])
_ZN14MM_RootScanner11scanThreadsEP18MM_EnvironmentBase+0xbf (0x00007FFAFAAE43FF [libj9gc29.so+0x3b3ff])
_ZN14MM_RootScanner9scanRootsEP18MM_EnvironmentBase+0x47 (0x00007FFAFAAE72A7 [libj9gc29.so+0x3e2a7])
_ZN18MM_MarkingDelegate9scanRootsEP18MM_EnvironmentBaseb+0x129 (0x00007FFAFAC4AD09 [libj9gc29.so+0x1a1d09])
_ZN19MM_ParallelMarkTask3runEP18MM_EnvironmentBase+0x79 (0x00007FFAFAC722B9 [libj9gc29.so+0x1c92b9])
_ZN21MM_ParallelDispatcher3runEP18MM_EnvironmentBaseP7MM_Taskm+0xa1 (0x00007FFAFABDB741 [libj9gc29.so+0x132741])
_ZN19MM_ParallelGlobalGC7markAllEP18MM_EnvironmentBaseb+0x12d (0x00007FFAFABF76BD [libj9gc29.so+0x14e6bd])
_ZN19MM_ParallelGlobalGC24mainThreadGarbageCollectEP18MM_EnvironmentBaseP22MM_AllocateDescriptionbb+0x15d (0x00007FFAFABF805D [libj9gc29.so+0x14f05d])
_ZN15MM_ConcurrentGC22internalGarbageCollectEP18MM_EnvironmentBaseP17MM_MemorySubSpaceP22MM_AllocateDescription+0x2a (0x00007FFAFAC7C4FA [libj9gc29.so+0x1d34fa])
_ZN12MM_Collector14garbageCollectEP18MM_EnvironmentBaseP17MM_MemorySubSpaceP22MM_AllocateDescriptionjP28MM_ObjectAllocationInterfaceS3_P20MM_AllocationContext+0xba (0x00007FFAFABB6F6A [libj9gc29.so+0x10df6a])
_ZN15MM_ConcurrentGC25concurrentFinalCollectionEP18MM_EnvironmentBaseP17MM_MemorySubSpace+0x119 (0x00007FFAFAC7FBB9 [libj9gc29.so+0x1d6bb9])
_ZN15MM_ConcurrentGC14concurrentMarkEP18MM_EnvironmentBaseP17MM_MemorySubSpaceP22MM_AllocateDescription+0x470 (0x00007FFAFAC803D0 [libj9gc29.so+0x1d73d0])
_ZN15MM_ConcurrentGC16payAllocationTaxEP18MM_EnvironmentBaseP17MM_MemorySubSpaceS3_P22MM_AllocateDescription+0x6f (0x00007FFAFAC809CF [libj9gc29.so+0x1d79cf])
_Z21OMR_GC_AllocateObjectP12OMR_VMThreadP25MM_AllocateInitialization+0x1d5 (0x00007FFAFABE88D5 [libj9gc29.so+0x13f8d5])
J9AllocateObject+0x46e (0x00007FFAFAAF4ACE [libj9gc29.so+0x4bace])
old_slow_jitNewObject+0x19c (0x00007FFAFB632C6C [libj9jit29.so+0x8c0c6c])
 (0x00007FFAFB647851 [libj9jit29.so+0x8d5851])
---------------------------------------

Corefile not correctly generated due to machine issue, will generate again to look into the corefile data more closely

@amicic
Copy link
Contributor

amicic commented Feb 7, 2023

Looking forward for more stack traces to see if this is specific to Concurrent Global GC...

But, If this is scanning a carrier thread with a mounted virtual thread, one of the first generic questions from GC perspective is if JIT ensured that both carrier and virtual threads are at a safe point to walk their stacks (to call J9AllocateObject rather than J9AllocateObjectNoGC)?

@LinHu2016
Copy link
Contributor

I have figured out what caused this issue.
It is GC continuation scanning and continuation mounting syntonization issue, it could only happen in gencon (concurrent marking enabled(default) or both concurrent marking and concurrent scavenger enable). It is not directly related with larger cpus machines, but more native threads could get more chance to expose the issue.

For temporary workaround is disabling concurrent marking via -Xgc:noConcurrentMarkKO.

We use "one way synchronization" between concurrent scanning and mounting instead of full mutex, the concurrent marking continuation scan block the continuation mounting(for the same continuation Object), mounting thread during waiting would release the vm access, there could be second race condition for vm access between scavenge(STW) and mounting thread requiring back vm access after concurrent mark continuation scanning is done, if scavenger win exclusive vm access, mounting thread would still be blocked(haven't swapped java stack), but the j9vmcontinuation.state has been marked to "mounted", it would cause GC(scavenger) skip the continuation scan, then related references on the java stack would not be updated/copyforwarded. for concurrent scavenger enabled case, there would be more chances to expose the problem.

I will think about the solution for both default gencon case and concurrent scavenger enabled case. please assign the issue to me.

@babsingh
Copy link
Contributor

babsingh commented Mar 13, 2023

@fengxue-IS Can you confirm via a grinder if this is just a perf issue now? It will be a perf issue if the failure goes away by applying the following workaround: -Xnocompressedrefs -Xjit:disableCodeCacheReclamation -Xgc:noConcurrentMarkKO. We can also increase the timeout of the benchmark to avoid timeouts. This will allow us to reuse the grinder to substantiate #16728 as a perf issue.

Note: Perf issues should not block the JDK19 release since we have decided to address perf issues closer to the JEP's GA milestone.

@fengxue-IS
Copy link
Contributor

Grinder seeing an assertion failure in GC with option
-Xnocompressedrefs -Xjit:disableCodeCacheReclamation -Xgc:noConcurrentMarkKO -Djdk.virtualThreadScheduler.parallelism=112

20:35:05.709 0x7f4cac003a00    j9mm.107    *   ** ASSERTION FAILED ** at /home/jenkins/workspace/Build_JDK19_x86-64_linux_Nightly/openj9/runtime/gc_glue_java/MarkingDelegate.hpp:129: ((false && ((uintptr_t)0x99669966 == clazz->eyecatcher)))
20:35:05.709 0x7f4c7c03cc00    j9mm.107    *   ** ASSERTION FAILED ** at /home/jenkins/workspace/Build_JDK19_x86-64_linux_Nightly/openj9/runtime/gc_glue_java/MarkingDelegate.hpp:129: ((false && ((uintptr_t)0x99669966 == clazz->eyecatcher)))

@LinHu2016 can you take a look at the output

@dmitripivkine
Copy link
Contributor

In general this assertion means Global Mark discovered object pointer (in the range of the heap) which is not pointed to the object (stale?) any more. These objects are taken from Work Packet, so it is relatively easy to figure out which one. Search of bad pointer over core might help to understand what has been scanned when it was discovered and put to WP. If you need to investigate such details and have systems core please let me know

@LinHu2016
Copy link
Contributor

@fengxue-IS noticed the failure run with -Xgc:noConcurrentMarkKO, it doesn't looks like relate with "pending to be mounted" case, which caused issue was reported originally, I will try to reproduce/debug the issue locally.

@pshipton
Copy link
Member

Keep this open until we have a fix in the 0.37 stream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants