-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/runtest: TestCover keeps timing out #4954
Comments
One more case: https://github.com/google/syzkaller/actions/runs/9758033237/job/26931740035 I was not the case up to this week, so I guess it's caused by some recent changes. |
It happens not just under
That's here: syzkaller/pkg/runtest/run_test.go Line 390 in 780d1bc
That's here: syzkaller/pkg/vminfo/features.go Line 69 in 780d1bc
So some syz-executors just stop communicating and we wait forever for their responses? That's bad by itself, we probably somehow not react to their errors. Regarding the reason why they fail -- could it be that the Docker container we use for testing e.g. somehow limits the number of simultaneous processes? After the recent change that made TestCover subtests create their own PRC servers, the load must have increased significantly. |
https://github.com/google/syzkaller/actions/runs/9779999561/job/27000577263
Probably related? |
Memory consumption doesn't look crazy:
In any case, I've tried to reproduce it many times locally and never managed to. |
@a-nogikh what commit do you think triggered or scaled this problem? |
I think we've begun to observe it after |
We may try to remove t.Parallel() from there, maybe there is some contention... |
Have you tried with |
The test VMs should be actually quite big (as told by @tarasmadan) -- up to 32 cores. If it were just getting slower, I think the test would have still passed, just slower. But here something definitely goes very wrong, and in a weird way. I wonder if we're really staying careful w.r.t. the syzkaller/pkg/rpcserver/local.go Line 87 in ecfab6a
syzkaller/pkg/rpcserver/local.go Lines 96 to 97 in ecfab6a
And then eventually close the flatrpc server. But do the runner loops ever notice that? If we don't send new requests, we're just sleeping and not even interacting with the socket. syzkaller/pkg/rpcserver/runner.go Lines 166 to 170 in ecfab6a
(If yes, we have a separate problem -- no new requests mean we never read from the socket, so we never get results for the previous one(s)). Ideally, |
Good point. There are sleeping goroutines in connectionLoop in the dump. |
2024/07/05 09:00:39 [FATAL] check failed: coverage is not supported: executor crashed |
https://github.com/google/syzkaller/actions/runs/9842015867/job/27170019883 Looks like after #4991 we've begun to fail early instead of hanging infinitely:
|
I've extracted and grouped the goroutines from one of the failed presubmit runs. I'll skip all the irrelevant stacks. We have some runners that sleep waiting for new requests. It's bad that there are so many of them, but these goroutines cannot block the test from exiting.
One of the local rpcservers hung at the machine check stage.
Since the machine check never finished, we never got to execute the test program:
Here's our hanged rpcserver.
And here's the runner loop that is blocked. We were awaiting a reply from the executor, but never got it.
At the same time, the process is alive -- otherwise
Here's where we are hanging in the machine check implementation.
And this is rpcserver's listening goroutine. Nothing wrong here.
|
So it does look that the executor may sometimes just not reply to the server. I can imagine it happen during fuzzing (we may execute some dangerous stuff), but it's definitely not normal during runtest tests, especially not normal at the machine check phase. |
#5019 should hopefully shed some more light on the problem |
https://github.com/google/syzkaller/actions/runs/9890519329/job/27318935942
|
On a recent after submit testing: https://github.com/google/syzkaller/actions/runs/9742478582/job/26883877667
The text was updated successfully, but these errors were encountered: