Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure the wait time for checkpoint safety #18489

Merged
merged 5 commits into from
Dec 7, 2023

Conversation

singh264
Copy link
Contributor

@singh264 singh264 commented Nov 20, 2023

Configure the checkpoint safety wait time with
-XX:maxRetryForNotCheckpointSafe= and
-XX:sleepMillisecondsForNotCheckpointSafe=, and increase
the default checkpoint safety wait time from 1 ms to 1
s.

Configure the checkpoint safety wait time for the
re-enabled MethodTypeDeadlockTest to be 2 s to allow the
threads to initialize a class, and fix the failure in
the MethodTypeDeadlockTest that occurs due to an
insufficient checkpoint safety wait time for the threads
initializing a class.

Fix the race condition in MethodTypeDeadlockTest for the
shared resource testResult.lockStatus by modifying its
type from volatile int to AtomicInteger, and avoid the
test from timing out when the main thread is blocked
while waiting for testResult.lockStatus to reflect the
count of the worker threads that were started by the
main thread.

Issue: #15806
Signed-off-by: Amarpreet Singh [email protected]

@@ -844,7 +851,12 @@ Java_org_eclipse_openj9_criu_CRIUSupport_checkpointJVMImpl(JNIEnv *env,
for (UDATA i = 0; (0 != notSafeToCheckpoint) && (i <= maxRetries); i++) {
releaseSafeOrExcusiveVMAccess(currentThread, vmFuncs, safePoint);
vmFuncs->internalExitVMToJNI(currentThread);
omrthread_nanosleep(10000);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I observe that the thread is unable to sleep for less than 1000000 ns (1 ms) on a Linux x86_64 machine.

@singh264 singh264 marked this pull request as ready for review November 20, 2023 15:23
@singh264
Copy link
Contributor Author

@tajila requesting your review.

@singh264
Copy link
Contributor Author

@tajila requesting your review.

@tajila tajila self-requested a review November 22, 2023 15:21
@@ -690,6 +691,12 @@ Java_org_eclipse_openj9_criu_CRIUSupport_checkpointJVMImpl(JNIEnv *env,
UDATA success = 0;
bool safePoint = J9_ARE_ANY_BITS_SET(vm->extendedRuntimeFlags, J9_EXTENDED_RUNTIME_OSR_SAFE_POINT);
UDATA maxRetries = vm->checkpointState.maxRetryForNotCheckpointSafe;
U_64 sleepNanoseconds = vm->checkpointState.sleepNanosecondsForNotCheckpointSafe;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think we need this complexity. By toggling max retry and nanosleep we can express the flexibility needed for this.

@@ -844,7 +851,12 @@ Java_org_eclipse_openj9_criu_CRIUSupport_checkpointJVMImpl(JNIEnv *env,
for (UDATA i = 0; (0 != notSafeToCheckpoint) && (i <= maxRetries); i++) {
releaseSafeOrExcusiveVMAccess(currentThread, vmFuncs, safePoint);
vmFuncs->internalExitVMToJNI(currentThread);
omrthread_nanosleep(10000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use omrthread_sleep and set it to a parameter for which the default is 10ms

@@ -3926,6 +3926,9 @@ processVMArgsFromFirstToLast(J9JavaVM * vm)
vm->checkpointState.lastRestoreTimeMillis = -1;
/* Its unclear if we need an option for this, so we can keep the init here for the time being */
vm->checkpointState.maxRetryForNotCheckpointSafe = 100;
vm->checkpointState.sleepNanosecondsForNotCheckpointSafe = 10000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you only need sleepNanosecondsForNotCheckpointSafe. But this can be milliseconds instead.

@singh264
Copy link
Contributor Author

i dont think we need this complexity. By toggling max retry and nanosleep we can express the flexibility needed for this.

I have simplified the code that increases the wait time for checkpoint safety from 1 ms to 2 s by sleeping for a fixed time per checkpoint safety check.

@singh264 singh264 requested a review from tajila November 30, 2023 22:17
@tajila
Copy link
Contributor

tajila commented Nov 30, 2023

You can add two options

-XX:maxRetryForNotCheckpointSafe=[intVal]
-XX:sleepMillisecondsForNotCheckpointSafe=[intVal]

And then we can increase those values (higher than default) for this test

@singh264
Copy link
Contributor Author

singh264 commented Dec 2, 2023

What would be the default, lower bound, and upper bound of the maxRetryForNotCheckpointSafe and sleepMillisecondsForNotCheckpointSafe user supplied values?

Would throwing an error be appropriate if the maxRetryForNotCheckpointSafe and sleepMillisecondsForNotCheckpointSafe user supplied values are outside the lower bound and upper bound?

@tajila
Copy link
Contributor

tajila commented Dec 2, 2023

What would be the default, lower bound, and upper bound of the maxRetryForNotCheckpointSafe and sleepMillisecondsForNotCheckpointSafe user supplied values?

0 <= maxRetryForNotCheckpointSafe < ∞

1 <= sleepMillisecondsForNotCheckpointSafe < ∞

Would throwing an error be appropriate if the maxRetryForNotCheckpointSafe and sleepMillisecondsForNotCheckpointSafe user supplied values are outside the lower bound and upper bound?

Ya we can throw the standard errors. Note, we won't be documenting these options. Its not something we intend users will need. The default should be enough.

@singh264
Copy link
Contributor Author

singh264 commented Dec 4, 2023

A stacktrace of MethodTypeDeadlockTest timing out due to the main thread getting blocked by a race condition for testResult.lockStatus:

Testing: Create and Restore Criu Checkpoint Image once - MethodTypeDeadlockTest
Test start time: 2023/12/01 11:16:46 Pacific Standard Time
Running command: bash /root/openj9_issues_15806/openj9-openjdk-jdk11/openj9/test/TKG/../../jvmtest/functional/cmdLineTests/criu/criuScript.sh /root/openj9_issues_15806/openj9-openjdk-jdk11/openj9/test/TKG/../../jvmtest/functional/cmdLineTests/criu /root/openj9_issues_15806/openj9-openjdk-jdk11/build/linux-x86_64-normal-server-release/images/jdk/bin/java " -Xjit:count=0 -XX:+CRIURestoreNonPortableMode -Xdump:system+heap+java+snap:events=abort -XX:+ThrowOnDelayedCheckpointOperation -Xtrace:print=j9criu.17 --add-opens java.base/jdk.internal.misc=ALL-UNNAMED  --add-exports java.base/openj9.internal.criu=ALL-UNNAMED" org.openj9.criu.DeadlockTest MethodTypeDeadlockTest 1
Time spent starting: 30 milliseconds
***[TEST INFO 2023/12/01 11:21:46] ProcessKiller detected a timeout after 300000 milliseconds!***
***[TEST INFO 2023/12/01 11:21:46] executing /usr/bin/gdb -batch -x /tmp/debugger1667586218156117718.txt bash 3826904***
GDB OUT 0x00007fce14e37c3a in wait4 () from /lib/x86_64-linux-gnu/libc.so.6
GDB OUT From                To                  Syms Read   Shared Object Library
GDB OUT 0x00007fce14f5b720  0x00007fce14f692ec  Yes (*)     /lib/x86_64-linux-gnu/libtinfo.so.6
GDB OUT 0x00007fce14f48220  0x00007fce14f49179  Yes (*)     /lib/x86_64-linux-gnu/libdl.so.2
GDB OUT 0x00007fce14d77630  0x00007fce14eec29d  Yes (*)     /lib/x86_64-linux-gnu/libc.so.6
GDB OUT 0x00007fce14f8a100  0x00007fce14fac684  Yes (*)     /lib64/ld-linux-x86-64.so.2
GDB OUT (*): Shared library is missing debugging information.
GDB OUT rax            0xfffffffffffffe00  -512
GDB OUT rbx            0x0                 0
GDB OUT rcx            0x7fce14e37c3a      140523090443322
GDB OUT rdx            0x0                 0
GDB OUT rsi            0x7fff0cd2cbd0      140733408529360
GDB OUT rdi            0xffffffff          4294967295
GDB OUT rbp            0x0                 0x0
GDB OUT rsp            0x7fff0cd2cbb8      0x7fff0cd2cbb8
GDB OUT r8             0x0                 0
GDB OUT r9             0x7                 7
GDB OUT r10            0x0                 0
GDB OUT r11            0x246               582
GDB OUT r12            0x1                 1
GDB OUT r13            0x7fff0cd2ccb0      140733408529584
GDB OUT r14            0x0                 0
GDB OUT r15            0x0                 0
GDB OUT rip            0x7fce14e37c3a      0x7fce14e37c3a <wait4+26>
GDB OUT eflags         0x246               [ PF ZF IF ]
GDB OUT cs             0x33                51
GDB OUT ss             0x2b                43
GDB OUT ds             0x0                 0
GDB OUT es             0x0                 0
GDB OUT fs             0x0                 0
GDB OUT gs             0x0                 0
GDB OUT k0             0x0                 0
GDB OUT k1             0x0                 0
GDB OUT k2             0x0                 0
GDB OUT k3             0x0                 0
GDB OUT k4             0x0                 0
GDB OUT k5             0x0                 0
GDB OUT k6             0x0                 0
GDB OUT k7             0x0                 0
GDB OUT   Id   Target Id              Frame 
GDB OUT * 1    process 3826904 "bash" 0x00007fce14e37c3a in wait4 () from /lib/x86_64-linux-gnu/libc.so.6
GDB OUT 
GDB OUT Thread 1 (process 3826904 "bash"):
GDB OUT #0  0x00007fce14e37c3a in wait4 () from /lib/x86_64-linux-gnu/libc.so.6
GDB OUT No symbol table info available.
GDB OUT #1  0x00005628a0c32f03 in ?? ()
GDB OUT No symbol table info available.
GDB OUT #2  0x00005628a0c34373 in wait_for ()
GDB OUT No symbol table info available.
GDB OUT #3  0x00005628a0c21216 in execute_command_internal ()
GDB OUT No symbol table info available.
GDB OUT #4  0x00005628a0c219da in execute_command ()
GDB OUT No symbol table info available.
GDB OUT #5  0x00005628a0c09665 in reader_loop ()
GDB OUT No symbol table info available.
GDB OUT #6  0x00005628a0c07ef9 in main ()
GDB OUT No symbol table info available.
GDB OUT Saved corefile core.3826904.1
GDB OUT [Inferior 1 (process 3826904) detached]
INFO: Running '/usr/bin/gdb' failed with rc = 1
GDB ERR warning: target file /proc/3826904/cmdline contained unexpected null characters
GDB ERR warning: Memory read failed for corefile section, 4096 bytes at 0xffffffffff600000.
GDB ERR /tmp/debugger1667586218156117718.txt:6: Error in sourced command file:
GDB ERR Requires argument (inferior id(s) to detach)
INFO: Sleep for 60000 millis before next capture.
***[TEST INFO 2023/12/01 11:22:47] executing /usr/bin/gdb -batch -x /tmp/debugger1667586218156117718.txt bash 3826904***
GDB OUT 0x00007fce14e37c3a in wait4 () from /lib/x86_64-linux-gnu/libc.so.6
GDB OUT From                To                  Syms Read   Shared Object Library
GDB OUT 0x00007fce14f5b720  0x00007fce14f692ec  Yes (*)     /lib/x86_64-linux-gnu/libtinfo.so.6
GDB OUT 0x00007fce14f48220  0x00007fce14f49179  Yes (*)     /lib/x86_64-linux-gnu/libdl.so.2
GDB OUT 0x00007fce14d77630  0x00007fce14eec29d  Yes (*)     /lib/x86_64-linux-gnu/libc.so.6
GDB OUT 0x00007fce14f8a100  0x00007fce14fac684  Yes (*)     /lib64/ld-linux-x86-64.so.2
GDB OUT (*): Shared library is missing debugging information.
GDB OUT rax            0xfffffffffffffe00  -512
GDB OUT rbx            0x0                 0
GDB OUT rcx            0x7fce14e37c3a      140523090443322
GDB OUT rdx            0x0                 0
GDB OUT rsi            0x7fff0cd2cbd0      140733408529360
GDB OUT rdi            0xffffffff          4294967295
GDB OUT rbp            0x0                 0x0
GDB OUT rsp            0x7fff0cd2cbb8      0x7fff0cd2cbb8
GDB OUT r8             0x0                 0
GDB OUT r9             0x7                 7
GDB OUT r10            0x0                 0
GDB OUT r11            0x246               582
GDB OUT r12            0x1                 1
GDB OUT r13            0x7fff0cd2ccb0      140733408529584
GDB OUT r14            0x0                 0
GDB OUT r15            0x0                 0
GDB OUT rip            0x7fce14e37c3a      0x7fce14e37c3a <wait4+26>
GDB OUT eflags         0x246               [ PF ZF IF ]
GDB OUT cs             0x33                51
GDB OUT ss             0x2b                43
GDB OUT ds             0x0                 0
GDB OUT es             0x0                 0
GDB OUT fs             0x0                 0
GDB OUT gs             0x0                 0
GDB OUT k0             0x0                 0
GDB OUT k1             0x0                 0
GDB OUT k2             0x0                 0
GDB OUT k3             0x0                 0
GDB OUT k4             0x0                 0
GDB OUT k5             0x0                 0
GDB OUT k6             0x0                 0
GDB OUT k7             0x0                 0
GDB OUT   Id   Target Id              Frame 
GDB OUT * 1    process 3826904 "bash" 0x00007fce14e37c3a in wait4 () from /lib/x86_64-linux-gnu/libc.so.6
GDB OUT 
GDB OUT Thread 1 (process 3826904 "bash"):
GDB OUT #0  0x00007fce14e37c3a in wait4 () from /lib/x86_64-linux-gnu/libc.so.6
GDB OUT No symbol table info available.
GDB OUT #1  0x00005628a0c32f03 in ?? ()
GDB OUT No symbol table info available.
GDB OUT #2  0x00005628a0c34373 in wait_for ()
GDB OUT No symbol table info available.
GDB OUT #3  0x00005628a0c21216 in execute_command_internal ()
GDB OUT No symbol table info available.
GDB OUT #4  0x00005628a0c219da in execute_command ()
GDB OUT No symbol table info available.
GDB OUT #5  0x00005628a0c09665 in reader_loop ()
GDB OUT No symbol table info available.
GDB OUT #6  0x00005628a0c07ef9 in main ()
GDB OUT No symbol table info available.
GDB OUT Saved corefile core.3826904.2
GDB OUT A debugging session is active.
GDB OUT 
GDB OUT   Inferior 1 [process 3826904] will be detached.
GDB OUT 
GDB OUT Quit anyway? (y or n) [answered Y; input not from terminal]
GDB OUT [Inferior 1 (process 3826904) detached]
***[TEST INFO 2023/12/01 11:22:47] executing kill -ABRT 3826904***
***[TEST INFO 2023/12/01 11:22:47] kill -ABRT signal sent***
Time spent executing: 360747 milliseconds
Test result: FAILED

>!threads
    !stack 0x0001aa00   !j9vmthread 0x0001aa00  !j9thread 0x7f7dc0007660    tid 0x3a5765 (3823461) // (main) <-----
    !stack 0x0001e600   !j9vmthread 0x0001e600  !j9thread 0x7f7dc00d81c0    tid 0x3a576b (3823467) // (JIT Compilation Thread-000)
    !stack 0x00022300   !j9vmthread 0x00022300  !j9thread 0x7f7dc00d8738    tid 0x3a576c (3823468) // (JIT Compilation Thread-001 Suspended)
    !stack 0x00025f00   !j9vmthread 0x00025f00  !j9thread 0x7f7dc00dc2f0    tid 0x3a576d (3823469) // (JIT Compilation Thread-002 Suspended)
    !stack 0x00029c00   !j9vmthread 0x00029c00  !j9thread 0x7f7dc00dc868    tid 0x3a576e (3823470) // (JIT Compilation Thread-003 Suspended)
    !stack 0x0002d800   !j9vmthread 0x0002d800  !j9thread 0x7f7dc00dd800    tid 0x3a576f (3823471) // (JIT Compilation Thread-004 Suspended)
    !stack 0x00031500   !j9vmthread 0x00031500  !j9thread 0x7f7dc00ddd78    tid 0x3a5770 (3823472) // (JIT Compilation Thread-005 Suspended)
    !stack 0x00035100   !j9vmthread 0x00035100  !j9thread 0x7f7dc00ded10    tid 0x3a5771 (3823473) // (JIT Compilation Thread-006 Suspended)
    !stack 0x00038e00   !j9vmthread 0x00038e00  !j9thread 0x7f7dc00df288    tid 0x3a5772 (3823474) // (JIT Compilation Thread-007 Suspended)
    !stack 0x0003ca00   !j9vmthread 0x0003ca00  !j9thread 0x7f7dc00e0220    tid 0x3a5773 (3823475) // (JIT Compilation Thread-008 Suspended)
    !stack 0x00040700   !j9vmthread 0x00040700  !j9thread 0x7f7dc00e0798    tid 0x3a5774 (3823476) // (JIT Compilation Thread-009 Suspended)
    !stack 0x00044300   !j9vmthread 0x00044300  !j9thread 0x7f7dc00e1730    tid 0x3a5775 (3823477) // (JIT Compilation Thread-010 Suspended)
    !stack 0x00048000   !j9vmthread 0x00048000  !j9thread 0x7f7dc00e1ca8    tid 0x3a5776 (3823478) // (JIT Compilation Thread-011 Suspended)
    !stack 0x0004bc00   !j9vmthread 0x0004bc00  !j9thread 0x7f7dc00e2c40    tid 0x3a5777 (3823479) // (JIT Compilation Thread-012 Suspended)
    !stack 0x0004f900   !j9vmthread 0x0004f900  !j9thread 0x7f7dc00e31b8    tid 0x3a5778 (3823480) // (JIT Compilation Thread-013 Suspended)
    !stack 0x00053500   !j9vmthread 0x00053500  !j9thread 0x7f7dc00e4150    tid 0x3a5779 (3823481) // (JIT Compilation Thread-014 Suspended)
    !stack 0x00057200   !j9vmthread 0x00057200  !j9thread 0x7f7dc00e46c8    tid 0x3a577a (3823482) // (JIT Diagnostic Compilation Thread-015 Suspended)
    !stack 0x0005ae00   !j9vmthread 0x0005ae00  !j9thread 0x7f7dc010c230    tid 0x3a577b (3823483) // (JIT-SamplerThread)
    !stack 0x00156c00   !j9vmthread 0x00156c00  !j9thread 0x7f7dc010c7a8    tid 0x3a577c (3823484) // (Common-Cleaner)
    !stack 0x0018b900   !j9vmthread 0x0018b900  !j9thread 0x7f7dc022cac8    tid 0x3a577f (3823487) // (Concurrent Mark Helper)
    !stack 0x0018f500   !j9vmthread 0x0018f500  !j9thread 0x7f7dc022d7e0    tid 0x3a5780 (3823488) // (GC Worker)
    !stack 0x00193200   !j9vmthread 0x00193200  !j9thread 0x7f7dc022dd58    tid 0x3a5781 (3823489) // (GC Worker)
    !stack 0x00196e00   !j9vmthread 0x00196e00  !j9thread 0x7f7dc022ea70    tid 0x3a5782 (3823490) // (GC Worker)
    !stack 0x0019ab00   !j9vmthread 0x0019ab00  !j9thread 0x7f7dc022efe8    tid 0x3a5783 (3823491) // (GC Worker)
    !stack 0x0019e700   !j9vmthread 0x0019e700  !j9thread 0x7f7dc022fd00    tid 0x3a5784 (3823492) // (GC Worker)
    !stack 0x001a2400   !j9vmthread 0x001a2400  !j9thread 0x7f7dc0230278    tid 0x3a5785 (3823493) // (GC Worker)
    !stack 0x001a6000   !j9vmthread 0x001a6000  !j9thread 0x7f7dc0230f90    tid 0x3a5786 (3823494) // (GC Worker)
    !stack 0x001dab00   !j9vmthread 0x001dab00  !j9thread 0x7f7d1c015d00    tid 0x3a5788 (3823496) // (Attach API wait loop)
    !stack 0x00212e00   !j9vmthread 0x00212e00  !j9thread 0x7f7cf8000cc0    tid 0x3a5793 (3823507) // (Finalizer thread)

> !stack 0x0001aa00
<1aa00>     !j9method 0x000000000007D230   java/lang/Thread.yield()V
<1aa00>     !j9method 0x000000000007D230   java/lang/Thread.yield()V
<1aa00>     !j9method 0x00000000001FC0A8   org/openj9/criu/DeadlockTest.methodTypeDeadlockTest()V <-----
<1aa00>     !j9method 0x00000000001FC048   org/openj9/criu/DeadlockTest.main([Ljava/lang/String;)V
<1aa00>                             JNI call-in frame
<1aa00>                             Native method frame

> !stackslots 0x0001aa00
<1aa00> *** BEGIN STACK WALK, flags = 00400001 walkThread = 109,056 ***
<1aa00>     ITERATE_O_SLOTS
<1aa00>     RECORD_BYTECODE_PC_OFFSET
<1aa00> Initial values: walkSP = 0x00000000001236C8, PC = 0x0000000000000006, literals = 0x0000000000000000, A0 = 0x0000000000123710, j2iFrame = 0x0000000000000000, decomp = 0x0000000000000000
<1aa00> JIT JNI call-out frame: bp = 0x00000000001236E8, sp = 0x00000000001236C8, pc = 0x0000000000000006, cp = 0x000000000007BF90, arg0EA = 0x0000000000123710, flags = 0x0000000020000000
<1aa00>     Method: java/lang/Thread.yield()V !j9method 0x000000000007D230
<1aa00> JIT frame: bp = 0x0000000000123708, pc = 0x00007F7DA3AE2C67, unwindSP = 0x00000000001236F0, cp = 0x000000000007BF90, arg0EA = 0x0000000000123708, jitInfo = 0x00007F7DA1DF5BB8
<1aa00>     Method: java/lang/Thread.yield()V !j9method 0x000000000007D230
<1aa00>     Bytecode index = -1, inlineDepth = 0, PC offset = 0x00000000000000AB
<1aa00>     stackMap=0x00007F7DA1DF5CB1, slots=I16(0x0000) parmBaseOffset=I16(0x0008), parmSlots=U16(0x0000), localBaseOffset=I16(0x0000)
<1aa00>     Described JIT temps starting at 0x0000000000123708 for IDATA(0x0000000000000000) slots
<1aa00>     JIT-RegisterMap = UDATA(0x0000000000000000)
<1aa00>     JIT-Frame-RegisterMap[0x00000000001236F8] = UDATA(0x000000000000000A) (jit_rbx)
<1aa00>     JIT-Frame-RegisterMap[0x0000000000123700] = UDATA(0x00000007FFFCB710) (jit_r9)
<1aa00> JIT frame: bp = 0x00000000001237B8, pc = 0x00007F7DA3AE5149, unwindSP = 0x0000000000123710, cp = 0x00000000001FB7A0, arg0EA = 0x00000000001237B8, jitInfo = 0x00007F7DA1DF4DF8
<1aa00>     Method: org/openj9/criu/DeadlockTest.methodTypeDeadlockTest()V !j9method 0x00000000001FC0A8
<1aa00>     Bytecode index = 87, inlineDepth = 0, PC offset = 0x0000000000000271 <-----
<1aa00>     stackMap=0x00007F7DA1DF4FDA, slots=I16(0x0000) parmBaseOffset=I16(0x0008), parmSlots=U16(0x0000), localBaseOffset=I16(0xFFB0)
<1aa00>     Described JIT temps starting at 0x0000000000123768 for IDATA(0x000000000000000A) slots
<1aa00>         I-Slot: : t9[0x0000000000123768] = 0x00000007FFFC3D38
<1aa00>         I-Slot: : t8[0x0000000000123770] = 0x00007F7DC74FCA68
<1aa00>         I-Slot: : t7[0x0000000000123778] = 0x00007F7DC7067790
<1aa00>         O-Slot: : t6[0x0000000000123780] = 0x0000000706138230
<1aa00>         I-Slot: : t5[0x0000000000123788] = 0x000000000001AA02
<1aa00>         I-Slot: : t4[0x0000000000123790] = 0x00000007FFFCB700
<1aa00>         O-Slot: : t3[0x0000000000123798] = 0x0000000706138250 <-----
<1aa00>         I-Slot: : t2[0x00000000001237A0] = 0x00000007FFFCB710
<1aa00>         I-Slot: : t1[0x00000000001237A8] = 0x00007F7DA3AE4B8D
<1aa00>         I-Slot: : t0[0x00000000001237B0] = 0x0000000000000202
<1aa00>     JIT-RegisterMap = UDATA(0x0000000000000000)
<1aa00>     JIT-Frame-RegisterMap[0x0000000000123750] = UDATA(0x0000002400000037) (jit_rbx)
<1aa00>     JIT-Frame-RegisterMap[0x0000000000123758] = UDATA(0x00007F7DC74FCA68) (jit_r9)
<1aa00> JIT frame: bp = 0x0000000000123808, pc = 0x00007F7DA3AE4B8D, unwindSP = 0x00000000001237C0, cp = 0x00000000001FB7A0, arg0EA = 0x0000000000123810, jitInfo = 0x00007F7DA1DF4BF8
<1aa00>     Method: org/openj9/criu/DeadlockTest.main([Ljava/lang/String;)V !j9method 0x00000000001FC048
<1aa00>     Bytecode index = 180, inlineDepth = 0, PC offset = 0x000000000000016D
<1aa00>     stackMap=0x00007F7DA1DF4D00, slots=I16(0x0001) parmBaseOffset=I16(0x0008), parmSlots=U16(0x0001), localBaseOffset=I16(0xFFF0)
<1aa00>     Described JIT args starting at 0x0000000000123810 for U16(0x0001) slots
<1aa00>         O-Slot: : a0[0x0000000000123810] = 0x0000000706138260
<1aa00>     Described JIT temps starting at 0x00000000001237F8 for IDATA(0x0000000000000002) slots
<1aa00>         O-Slot: : t1[0x00000000001237F8] = 0x0000000706138270
<1aa00>         O-Slot: : t0[0x0000000000123800] = 0x0000000706138270
<1aa00>     JIT-RegisterMap = UDATA(0x0000000000000000)
<1aa00>     JIT-Frame-RegisterMap[0x00000000001237E0] = UDATA(0x0000002400000037) (jit_rbx)
<1aa00>     JIT-Frame-RegisterMap[0x00000000001237E8] = UDATA(0x00007F7DC74FCA68) (jit_r9)
<1aa00> I2J values: PC = 0x00007F7DC7438834, A0 = 0x0000000000123840, walkSP = 0x0000000000123820, literals = 0x0000000000000000, JIT PC = 0x00007F7DC688E050, pcAddress = 0x00007F7DC74FCAC8, decomp = 0x0000000000000000
<1aa00> JNI call-in frame: bp = 0x0000000000123840, sp = 0x0000000000123820, pc = 0x00007F7DC7438834, cp = 0x0000000000000000, arg0EA = 0x0000000000123840, flags = 0x0000000000020000
<1aa00>     New ELS = 0x0000000000000000
<1aa00> JNI native method frame: bp = 0x00000000001238E8, sp = 0x0000000000123848, pc = 0x0000000000000007, cp = 0x0000000000000000, arg0EA = 0x00000000001238E8, flags = 0x0000000000000010
<1aa00>     Object pushes starting at 0x0000000000123848 for 16 slots
<1aa00>         Push[0x0000000000123848] = 0x0000000000000000
<1aa00>         Push[0x0000000000123850] = 0x0000000000000000
<1aa00>         Push[0x0000000000123858] = 0x0000000000000000
<1aa00>         Push[0x0000000000123860] = 0x0000000000000000
<1aa00>         Push[0x0000000000123868] = 0x0000000706138260
<1aa00>         Push[0x0000000000123870] = 0x00000007060F0BC8
<1aa00>         Push[0x0000000000123878] = 0x000000070611DE78
<1aa00>         Push[0x0000000000123880] = 0x000000070611DE78
<1aa00>         Push[0x0000000000123888] = 0x0000000706138280
<1aa00>         Push[0x0000000000123890] = 0x0000000000000000
<1aa00>         Push[0x0000000000123898] = 0x00000007061142C8
<1aa00>         Push[0x00000000001238A0] = 0x00000007060F4680
<1aa00>         Push[0x00000000001238A8] = 0x00000007060F45B0
<1aa00>         Push[0x00000000001238B0] = 0x00000007060F0340
<1aa00>         Push[0x00000000001238B8] = 0x00000007060F16F0
<1aa00>         Push[0x00000000001238C0] = 0x00000007060F0000
<1aa00> <end of stack>
<1aa00> *** END STACK WALK (rc = NONE) ***

> !bytecodes 0x00000000001FC0A8
  Name: methodTypeDeadlockTest
  Signature: ()V
  Access Flags (18260009): public static 
  Internal Attribute Flags:
  Max Stack: 7
  Caught Exceptions (1):
     start   end   handler   catch type
     -----   ---   -------   ----------
     176  196  199   org/eclipse/openj9/criu/JVMCheckpointException
  Argument Count: 0
  Temp Count: 10

    0 ldc 7 (java.lang.String) cpData
    2 iconst0 
    3 anewarray 27 java/lang/String
    6 invokestatic 28 java/nio/file/Paths.get(Ljava/lang/String;[Ljava/lang/String;)Ljava/nio/file/Path;
    9 astore0 
   10 newdup 30 org/openj9/criu/TestResult
   13 dup 
   14 iconst1 
   15 iconst0 
   16 invokespecial 31 org/openj9/criu/TestResult.<init>(ZI)V
   19 astore1 
   20 aload1 
   21 invokedynamic 3 bsm #3:run(Lorg/openj9/criu/TestResult;)Ljava/lang/Runnable;
   24 nop 
   25 nop 
   26 astore2 
   27 bipush 10
   29 anewarray 32 java/lang/Thread
   32 astore3 
   33 aload3 
   34 astore 4
   36 aload 4
   38 arraylength 
   39 istore 5
   41 iconst0 
   42 istore 6
   44 iload 6
   46 iload 5
   48 ificmpge 79
   51 aload 4
   53 iload 6
   55 aaload 
   56 astore 7
   58 newdup 32 java/lang/Thread
   61 dup 
   62 aload2 
   63 invokespecial 33 java/lang/Thread.<init>(Ljava/lang/Runnable;)V
   66 astore 7
   68 aload 7
   70 invokevirtual 34 java/lang/Thread.start()V
   73 iinc 6 1
   76 goto 44
   79 aload1 
   80 getfield 38 org/openj9/criu/TestResult.lockStatus I
   83 iconst5 
   84 ificmpge 93
   87 invokestatic 39 java/lang/Thread.yield()V <-----
   90 goto 79
   93 iconst1 
   94 anewarray 51 java/net/URL
   97 dup 
   98 iconst0 
   99 ldc 13 (java.lang.Class) org/openj9/criu/DeadlockTest$A
  101 invokevirtual 52 java/lang/Class.getProtectionDomain()Ljava/security/ProtectionDomain;
  104 invokevirtual 53 java/security/ProtectionDomain.getCodeSource()Ljava/security/CodeSource;
  107 invokevirtual 54 java/security/CodeSource.getLocation()Ljava/net/URL;
  110 aastore 
  111 astore 4
  113 newdup 55 java/net/URLClassLoader
  116 dup 
  117 aload 4
  119 invokespecial 56 java/net/URLClassLoader.<init>([Ljava/net/URL;)V
  122 astore 5
  124 ldc 13 (java.lang.Class) org/openj9/criu/DeadlockTest$A
  126 invokestatic 57 org/openj9/criu/DeadlockTest.getClassBytesFromResource(Ljava/lang/Class;)[B
  129 astore 6
  131 getstatic 58 org/openj9/criu/DeadlockTest.unsafe Ljdk/internal/misc/Unsafe;
  134 ldc 13 (java.lang.Class) org/openj9/criu/DeadlockTest$A
  136 invokevirtual 59 java/lang/Class.getName()Ljava/lang/String;
  139 aload 6
  141 iconst0 
  142 aload 6
  144 arraylength 
  145 aload 5
  147 aconstnull 
  148 invokevirtual 60 jdk/internal/misc/Unsafe.defineClass(Ljava/lang/String;[BIILjava/lang/ClassLoader;Ljava/security/ProtectionDomain;)Ljava/lang/Class;
  151 astore 7
  153 newdup 35 org/eclipse/openj9/criu/CRIUSupport
  156 dup 
  157 aload0 
  158 invokespecial 36 org/eclipse/openj9/criu/CRIUSupport.<init>(Ljava/nio/file/Path;)V
  161 astore 8
  163 aload 8
  165 aload 7
  167 invokedynamic 4 bsm #4:run(Ljava/lang/Class;)Ljava/lang/Runnable;
  170 nop 
  171 nop 
  172 invokevirtual 37 org/eclipse/openj9/criu/CRIUSupport.registerPreCheckpointHook(Ljava/lang/Runnable;)Lorg/eclipse/openj9/criu/CRIUSupport;
  175 pop 
  176 getstatic 40 java/lang/System.out Ljava/io/PrintStream;
  179 ldc 8 (java.lang.String) Pre-checkpoint
  181 invokevirtual 41 java/io/PrintStream.println(Ljava/lang/String;)V
  184 aload 8
  186 aload0 
  187 iconst1 
  188 invokestatic 42 org/openj9/criu/CRIUTestUtils.checkPointJVM(Lorg/eclipse/openj9/criu/CRIUSupport;Ljava/nio/file/Path;Z)V
  191 aload1 
  192 iconst1 
  193 putfield 43 org/openj9/criu/TestResult.testPassed Z
  196 goto 211
  199 astore 9
  201 aload1 
  202 iconst0 
  203 putfield 43 org/openj9/criu/TestResult.testPassed Z
  206 aload 9
  208 invokevirtual 48 org/eclipse/openj9/criu/JVMCheckpointException.printStackTrace()V
  211 aload1 
  212 getfield 43 org/openj9/criu/TestResult.testPassed Z
  215 ifeq 229
  218 getstatic 40 java/lang/System.out Ljava/io/PrintStream;
  221 ldc 10 (java.lang.String) TEST PASSED
  223 invokevirtual 41 java/io/PrintStream.println(Ljava/lang/String;)V
  226 goto 237
  229 getstatic 40 java/lang/System.out Ljava/io/PrintStream;
  232 ldc 11 (java.lang.String) TEST FAILED
  234 invokevirtual 41 java/io/PrintStream.println(Ljava/lang/String;)V
  237 iconst0 
  238 invokestatic 49 java/lang/System.exit(I)V
  241 return0 

> !j9object 0x0000000706138250
!J9Object 0x0000000706138250 {
  struct J9Class* clazz = !j9class 0x1FC600 // org/openj9/criu/TestResult
  Object flags = 0x00000020;
  I lockword = 0x00000000 (offset = 0) (java/lang/Object) <hidden>
  Z testPassed = 0x00000001 (offset = 4) (org/openj9/criu/TestResult)
  I lockStatus = 0x00000003 (offset = 8) (org/openj9/criu/TestResult) <-----
}

@singh264 singh264 changed the title Increase the wait time for checkpoint safety Add options to configure the checkpoint safety wait time Dec 4, 2023
@singh264 singh264 marked this pull request as draft December 4, 2023 18:03
@singh264
Copy link
Contributor Author

singh264 commented Dec 5, 2023

@tajila I added options to configure the checkpoint safety wait time with -XX:maxRetryForNotCheckpointSafe= and -XX:sleepMillisecondsForNotCheckpointSafe=, configured the checkpoint safety wait time for the re-enabled MethodTypeDeadlockTest to be 2 s, and fixed the race condition in MethodTypeDeadlockTest for the shared resource testResult.lockStatus by modifying its type from volatile int to AtomicInteger.

Re-requesting your review.

@singh264 singh264 marked this pull request as ready for review December 5, 2023 01:19
@tajila
Copy link
Contributor

tajila commented Dec 5, 2023

Can you do some grinders on x, p and z to verify that the test will not fail. Do 50x for each and try to spread them out on different machines.

@singh264
Copy link
Contributor Author

singh264 commented Dec 5, 2023

Can you do some grinders on x, p and z to verify that the test will not fail. Do 50x for each and try to spread them out on different machines.

The grinder on x passed, the cause of the failures for the grinders on all of the p machines and on some of the z machines seems to be due to the setup of CRIU on the machines, and the grinders do not indicate failures that occur due to a timeout while code is being executed in @NotCheckpointSafe frames or failures that occur due to the test timing out by the race condition.

x: https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/36552
p: https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/36571
z: https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/36554

@singh264 singh264 changed the title Add options to configure the checkpoint safety wait time Configure the wait time for checkpoint safety Dec 5, 2023
@tajila
Copy link
Contributor

tajila commented Dec 5, 2023

okay, enable the test on X only for now.

Increase the wait time for checkpoint safety from 1 ms
to 2 s for threads initializing a class, and fix the
failure in the re-enabled MethodTypeDeadlockTest that
occurs due to an insufficient wait time for checkpoint
safety for threads initializing a class.

Issue: eclipse-openj9#15806
Signed-off-by: Amarpreet Singh <[email protected]>
Simplify the code that increases the wait time for
checkpoint safety from 1 ms to 2 s by sleeping for a
fixed time per checkpoint safety check.

Issue: eclipse-openj9#15806
Signed-off-by: Amarpreet Singh <[email protected]>
Add -XX:maxRetryForNotCheckpointSafe= and
-XX:sleepMillisecondsForNotCheckpointSafe= to control the
checkpoint safety wait time for MethodTypeDeadlockTest.

Issue: eclipse-openj9#15806
Signed-off-by: Amarpreet Singh <[email protected]>
Fix the race condition in MethodTypeDeadlockTest for the
shared resource testResult.lockStatus by modifying its
type from volatile int to AtomicInteger, and avoid the
test from timing out when the main thread is blocked
while waiting for testResult.lockStatus to reflect the
count of the worker threads that were started by the
main thread.

Issue: eclipse-openj9#15806
Signed-off-by: Amarpreet Singh <[email protected]>
Disable criu_nonPortableRestore tests on s390x due to the
limited number of s390x machines with proper CRIU setup.

Issue: eclipse-openj9#15806
Signed-off-by: Amarpreet Singh <[email protected]>
@singh264
Copy link
Contributor Author

singh264 commented Dec 5, 2023

okay, enable the test on X only for now.

I have disabled the test on z, it is now enabled only for x.

@tajila
Copy link
Contributor

tajila commented Dec 6, 2023

jenkins test sanity alinux64 jdk17

@tajila tajila merged commit b80ae24 into eclipse-openj9:master Dec 7, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:vm criu Used to track CRIU snapshot related work
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants