-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gforker: propagation of exit status #7115
Comments
Fantastic! Now we have phishing in GitHub comments? 🤦 |
The code that aggregates exit code is here : Lines 207 to 209 in 4f8e4a9
If you can reproduce it locally, I would try add debug printf there to see if anything is usual. PS: after that, I would try check here: Lines 407 to 427 in 4f8e4a9
|
Here you have a reproducer that at least show different behavior between hydra and gforker, and their interaction with the child process in case of SIGNIT vs no signal. code='raise KeyboardInterrupt'
#code='raise RuntimeError'
#code='import sys; sys.exit(42)'
python -c "$code" >/dev/null 2>&1
echo $?
mpiexec.hydra -n 1 python -c "$code" >/dev/null 2>&1
echo $?
mpiexec.gforker -n 1 python -c "$code" >/dev/null 2>&1
echo $? The code If you run the exact lines above in bash, you should get: 130
2
0 If you uncomment
If you uncomment
I'll detail an important Python-specific piece of information: Upon an unhandled setsig(SIGINT, SIG_DFL);
kill(getpid(), SIGINT);
return SIGINT + 128; |
I see. Thanks for the simple reproducer! I can make gforker to return the same as Lines 407 to 427 in 4f8e4a9
rc = prog_stat; .
|
Indeed, the following patch makes hydra and gforker behave identically with my reproducer. I think it would be a good idea to push these changes to make the two PMs behave consistently. diff --git a/src/pm/util/process.c b/src/pm/util/process.c
index b5b1ad2d1e..c0757144a2 100644
--- a/src/pm/util/process.c
+++ b/src/pm/util/process.c
@@ -411,6 +411,7 @@ void MPIE_ProcessSetExitStatus(ProcessState * pState, int prog_stat)
rc = WEXITSTATUS(prog_stat);
}
if (WIFSIGNALED(prog_stat)) {
+ rc = prog_stat;
sigval = WTERMSIG(prog_stat);
}
pState->exitStatus.exitStatus = rc; However, after further consideration, I do not think this is the root of problem in my CI failures. When executing mpiexec through Python's |
A few times a month, mpi4py's testsuite running daily in GitHub Actions fails the following way:
Full logs here
This is the test code is here.
What this test does is to execute some Python code with np=1,2 , and then one of the ranks (but not both in the np=2 case) signals itself with SIGINT. Upon an uncaught SIGINT, Python's signal/error handling prints to stderr and exits cleanly with return status=SIGINT+128 (that it, 130 on Linux). My test is indeed checking that exit statuses are propagated properly from the MPI process through mpiexec and back to the user. Very sporadically, on GitHub Actions, that is not the case for gforker.
I believe there must be either some sort of race condition, or perhaps a two short waiting time, that originates this non-reproducible issue. I looked at gforker's sources, but I'm afraid I do not understand it well enough as to guess where the problem may be. Any tips on how to further debug the issue?
The text was updated successfully, but these errors were encountered: