Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jenkins jobs lose contact with AArch64 macOS machine #6774

Closed
knn-k opened this issue Oct 19, 2022 · 25 comments
Closed

Jenkins jobs lose contact with AArch64 macOS machine #6774

knn-k opened this issue Oct 19, 2022 · 25 comments

Comments

@knn-k
Copy link
Contributor

knn-k commented Oct 19, 2022

PR #6637 added AArch64 macOS machines to the pipeline recently.
I see the Jenkins jobs on AArch64 macOS often fail with the following exception in the middle of running tests:

Cannot contact mac11-aarch64-08: java.lang.InterruptedException

@knn-k
Copy link
Contributor Author

knn-k commented Oct 19, 2022

See the examples below.

https://ci.eclipse.org/omr/job/PullRequest-osx_aarch64/17/consoleText:

[2022-10-12T05:40:19.244Z] 14: [----------] 38 tests from PortSysinfoTest
[2022-10-12T05:40:19.244Z] 14: originalSoftLimit=10240
[2022-10-12T05:40:19.244Z] 14: originalHardLimit=24576
[2022-10-12T05:40:19.244Z] 14: soft set to hard limit=24576
[2022-10-12T05:40:46.322Z] Cannot contact mac11-aarch64-08: java.lang.InterruptedException

https://ci.eclipse.org/omr/job/PullRequest-osx_aarch64/20/consoleText

[2022-10-13T01:02:32.938Z] 19: [----------] 2 tests from ThreadExtendedTest
[2022-10-13T01:03:29.709Z] Cannot contact mac11-aarch64-08: java.lang.InterruptedException

https://ci.eclipse.org/omr/job/PullRequest-osx_aarch64/22/consoleText

[2022-10-19T06:41:56.172Z] 19: [----------] 2 tests from ThreadExtendedTest
[2022-10-19T06:42:50.333Z] Cannot contact mac11-aarch64-08: java.lang.InterruptedException

https://ci.eclipse.org/omr/job/PullRequest-osx_aarch64/23/consoleText

[2022-10-19T07:06:28.689Z] 19: [==========] Running 6 tests from 4 test cases.
[2022-10-19T07:06:28.689Z] 19: [----------] 2 tests from ThreadCpuTime
[2022-10-19T07:06:48.183Z] Cannot contact mac11-aarch64-08: java.lang.InterruptedException

Other recent jobs ran to the end of the tests on the same machine, mac11-aarch64-08.
https://ci.eclipse.org/omr/job/PullRequest-osx_aarch64/16/
https://ci.eclipse.org/omr/job/PullRequest-osx_aarch64/18/
https://ci.eclipse.org/omr/job/PullRequest-osx_aarch64/19/
https://ci.eclipse.org/omr/job/PullRequest-osx_aarch64/21/

@knn-k
Copy link
Contributor Author

knn-k commented Oct 19, 2022

fyi @AdamBrousseau

@knn-k knn-k changed the title Jenkins jobs lose contact with AArch64 macOS machines Jenkins jobs lose contact with AArch64 macOS machine Oct 19, 2022
@knn-k
Copy link
Contributor Author

knn-k commented Oct 24, 2022

@AdamBrousseau Is there any way for collecting more information on what happens when the "Cannot contact mac11-aarch64-08" message appears?

@AdamBrousseau
Copy link
Contributor

This is a machine that is shared with the OpenJ9 Jenkins farm. We use a different username to ssh to the machine.
I notice there are a lot of core dumps and related files in the jenkins user's (openj9) home dir. OpenJ9 jenkins has taken the machine offline at the moment because there is less than 1GB space left in $HOME/jenkins. That being said, I don't think this should be causing the error you are seeing (Cannot contact mac11-aarch64-08). OMR Jenkins should technically be taking the node offline too because omr user would share the same partition (/Users). But it should not be taking it offline when there is a job running.

Interesting. We had made the change to connect as omr user but the node was still connected as jenkins user. I restarted the agent and it is now using omr. Let's keep an eye on it to see if that solves the problem. There is a job on OpenJ9 farm that will kill all the processes owned by jenkins including the agent jar so I'm hoping this is what is happening.

I will follow up on the core files from OpenJ9 farm.

@knn-k
Copy link
Contributor Author

knn-k commented Nov 1, 2022

Two failures in a row today.

https://ci.eclipse.org/omr/job/PullRequest-osx_aarch64/32/

[2022-11-01T13:04:07.018Z] Cannot contact mac11-aarch64-08: java.lang.InterruptedException

https://ci.eclipse.org/omr/job/PullRequest-osx_aarch64/33/

[2022-11-01T13:18:13.142Z] Cannot contact mac11-aarch64-08: java.lang.InterruptedException

@knn-k
Copy link
Contributor Author

knn-k commented Nov 2, 2022

@AdamBrousseau
Copy link
Contributor

@knn-k @0xdaryl How large of an impact would it be if we temporarily disable the aarch64 mac from the omr farm? We would need at least a few days to prove/disprove that the second java agent is causing us to run out of memory.

@0xdaryl
Copy link
Contributor

0xdaryl commented Nov 2, 2022

The macOS on Apple Silicon build with OMR doesn't run cleanly at the moment. @knn-k is gradually fixing the functional problems though. I would say taking it offline for a few days won't have a big impact at this stage.

@knn-k
Copy link
Contributor Author

knn-k commented Nov 2, 2022

It is acceptable to temporarily disable amac builds.

@knn-k
Copy link
Contributor Author

knn-k commented Nov 8, 2022

Still failing frequently -- 4 jobs out of recent 10:
https://ci.eclipse.org/omr/job/PullRequest-osx_aarch64/37 40 44 46

@knn-k
Copy link
Contributor Author

knn-k commented Nov 21, 2022

5 jobs (48, 49, 50, 53, 54) out of 8 failed with Cannot contact mac11-aarch64-08 since last week.
https://ci.eclipse.org/omr/job/PullRequest-osx_aarch64/

@AdamBrousseau
Copy link
Contributor

Somehow the scc for the java running the jenkins agent is getting corrupted. I would assume there's a problem with the jdk or caused by the omr testing. Not sure. I can give dev(s) access to the machine if looking at the javacore etc files will help.

@AdamBrousseau
Copy link
Contributor

Tried running a test by hand. It hangs then gets booted off the machine.

mac11-aarch64-8:~ omr$ /Users/omr/workspace/Build/build/fvtest/threadextendedtest/omrthreadextendedtest "--gtest_output=xml:/Users/omr/workspace/Build/build/fvtest/threadextendedtest/omrthreadextendedtest-results.xml"
[==========] Running 6 tests from 4 test cases.
[----------] 2 tests from ThreadCpuTime
[----------] 2 tests from ThreadCpuTime (495 ms total)

[----------] 1 test from CpuTimeTest
[----------] 1 test from CpuTimeTest (10 ms total)

[----------] 1 test from ApplicationCpuTimeTest
[----------] 1 test from ApplicationCpuTimeTest (9 ms total)

[----------] 2 tests from ThreadExtendedTest
client_loop: send disconnect: Broken pipe

Rebooting now....

@AdamBrousseau
Copy link
Contributor

Same thing. Could there be something in those tests that is causing an issue?

@knn-k
Copy link
Contributor Author

knn-k commented Feb 22, 2023

Thank you, @AdamBrousseau.
I reproduced the reboot of macOS in my local environment by running omrthreadextendedtest.
I ran it in the debugger, but I was not able to catch anything before the OS rebooted.

@knn-k
Copy link
Contributor Author

knn-k commented Feb 24, 2023

There are two tests in ThreadExtendedTest, and TestOtherThreadCputime in it causes the OS reboot.
It creates 10 threads, and they repeat calling omrtime_current_time_millis() in a busy loop in cpuLoad().
The test runs fine when I add usleep(1000); in the busy loop.

It could be a problem of macOS's tolerance to high CPU load under a certain condition.

@knn-k
Copy link
Contributor Author

knn-k commented Feb 24, 2023

Attached is a simplified standalone C test program for TestOtherThreadCputime.
threadLoadTest.c.txt

M1 Mac with macOS 11.7.4 crashes by running this program.
The program runs fine when usleep(1); in cpuLoad() is enabled.

How can I report it to Apple?

@knn-k
Copy link
Contributor Author

knn-k commented Feb 24, 2023

I opened PR #6903 for disabling the test for the time being.

@knn-k
Copy link
Contributor Author

knn-k commented Feb 24, 2023

I also tried running omrthreadextendedtest on another Mac (M1 Pro macOS 12.6.3), and it did not fail.
I don't know what makes the difference, the OS version (11.x / 12.x), the CPU (M1 / M1 Pro), or anything else.

Which version of macOS does mac11-aarch64-08 run?

@AdamBrousseau
Copy link
Contributor

Which version of macOS does mac11-aarch64-08 run?

11.7.1

@knn-k
Copy link
Contributor Author

knn-k commented Feb 28, 2023

#6903 was merged, and the Jenkins job runs without the OS reboot now.
https://ci.eclipse.org/omr/job/PullRequest-osx_aarch64/73
Socket test failure caused by #6516 is another story.

@AdamBrousseau
Copy link
Contributor

AdamBrousseau commented Jul 18, 2023

Can this be closed now @knn-k ?
The machine has been upgraded to Mac13 fyi.

@knn-k
Copy link
Contributor Author

knn-k commented Jul 18, 2023

My local test shows macOS 13 is more robust than macOS 11 against frequent calls to gettimeofday().
So I opened PR #7064 to enable the test again.

Let's close this issue after the PR is merged.

@knn-k
Copy link
Contributor Author

knn-k commented Jul 21, 2023

#7064 has been merged. Closing.

@knn-k knn-k closed this as completed Jul 21, 2023
@knn-k
Copy link
Contributor Author

knn-k commented Jul 21, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants