Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test suites for core:sync and core:sync/chan #4232

Merged
merged 23 commits into from
Sep 16, 2024

Conversation

Feoramund
Copy link
Contributor

This patch required much tinkering to get right. In the process, I've fixed over a dozen synchronization bugs, including the failure to detect more than 1 core (!) on FreeBSD and NetBSD in the core library and (for FreeBSD only) the Odin compiler itself. That bugfix itself improves the test runner's speed on those two platforms.

Of note, I converted @laytan's hack in the POSIX thread startup proc to use a semaphore instead of a mutex/condition variable, which should be more robust since it depends on a non-zero state instead of needing to loop on the condition.

I took to the entire test suite with -sanitize:thread and eliminated every data race I could find, and with the test suite now testing every primitive, I was able to eliminate several deadlocks. I rewrote the recursive benaphore because I was having trouble avoiding deadlocks with the particular order it took to modifying state. I think it's more robust now, but I'm not 100% certain.

I've run these tests locally on Linux, FreeBSD, and NetBSD several hundred times, with and without -sanitize:thread on (for Linux only, of course). Given the nature of sync issues, it's necessary to run it multiple times just to be sure.

Unfortunately, one of the core:sync tests isn't completing on the Darwin CI, which stalls the entire test runner because (I think) that the runner is having an issue with triggering the failure timeout via pthread cancel, which we've discussed previously about its unreliability on Darwin. I'm not sure what the actual underlying issue would be though, with regard to which test or tests are causing a deadlock. I do not have a local Darwin machine to investigate, but if someone could take a look at it, that'd be great. For now, I've marked the test suite as disabled on Darwin.

To summarize, we now have basic coverage for every core:sync primitive as well as chan.Chan. There may still be some edge cases for chan, but I fixed the ones I could find.

The calls to `atomic_add*` return the value before adding, not after, so
the previous code was causing the occasional data race.
This was occuring about 1/100 times with the test runner's thread pool.
One less value to store, and it should be less of a hack too.

Semaphores will not wait around if they have the go-ahead; they depend
on an internal value being non-zero, instead of whatever was loaded when
they started waiting, which is the case with a `Cond`.
This can prevent a data race on Linux with `Self_Cleanup`.
This will also keep messages from being sent to closed, buffered
channels in general.
`w_waiting` is the signal that says a caller is waiting to be able to
send something. It is incremented upon send and - in the case of an
unbuffered channel - it can only hold one message.

Therefore, check that `w_waiting` is zero instead.
A thread made inside a test does not share the test index of its parent,
so any time one of those threads failed an assert, it would tell the
runner to shutdown test index zero.
Previously, if the owner called this, it would fail.
@Feoramund
Copy link
Contributor Author

Regarding the NetBSD CI failure:

[ERROR] --- [2024-09-11 19:10:43] [structs.odin:66:wait_for()] waitpid() failure: Interrupted system call

I saw this come up intermittently before in my repo's CI, and it only started happening after I fixed the CPU count detection in core. I haven't been able to replicate it locally, and I am not sure yet what the fix would be.

@laytan
Copy link
Sponsor Collaborator

laytan commented Sep 11, 2024

I haven't been able to replicate it locally, and I am not sure yet what the fix would be.

Ah looks like I wrote that wrong, the loop should check for EINTR and loop (continue).

@Feoramund
Copy link
Contributor Author

Regarding the stalls on Darwin, I want to mention that I tried porting the usage of the SHARED flags that were introduced into the compiler by 0342617, into the usage of ulock_* and os_sync_* in the Odin core library as a possible fix to my repo's CI, but it continued to stall.

@flysand7
Copy link
Contributor

I've run these tests locally on Linux, FreeBSD, and NetBSD several hundred times, with and without -sanitize:thread on (for Linux only, of course). Given the nature of sync issues, it's necessary to run it multiple times just to be sure.

Just a heads up, not sure about FreeBSD and NetBSD kernels, but on linux sometimes running multiple times does not help, since the scheduler seemed very deterministic. I remember when debugging some of my multithreaded code, if it occurred once it was ocurring every single time, and if it never occurred I had trouble triggering it.

Not sure what the solution is, except running a shitton of times, while inserting calls to sched_yield at some random places (usually before taking or releasing locks). But yeah It's just a thing I've noticed, maybe since 5.0 linux did change their scheduler or something, I'm not sure.

@Feoramund
Copy link
Contributor Author

Just a heads up, not sure about FreeBSD and NetBSD kernels, but on linux sometimes running multiple times does not help, since the scheduler seemed very deterministic.

I've been doing the bulk of my development and testing on Linux 6.10, and running multiple times often produced different results, especially for non-deterministic deadlocks. The deadlock I commented about for the Auto_Reset_Event test was one of these, and the recursive benaphore demonstrated another. Sometimes they would pass, sometimes not. Sometimes they would fail more often if run with -define:ODIN_TEST_THREADS=1 or with the complete normal.odin test suite. There was little in the way of predictability about it.

@@ -164,12 +164,17 @@ send_raw :: proc "contextless" (c: ^Raw_Chan, msg_in: rawptr) -> (ok: bool) {
}
if c.queue != nil { // buffered
sync.guard(&c.mutex)
for c.queue.len == c.queue.cap {
for !sync.atomic_load(&c.closed) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it really be atomically loaded? Every places where c.closed are accessed are guarded by c.mutex, meaning the access to the variable is exclusive and acquire-release semantics should apply across other threads accessing the same critical sections.

Help me understand the logic behind the change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second look, we might be able to change all c.closed loads to non-atomic, not just this one. I was following convention on this change, since the other accesses are atomic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this should save an extra lwsync operation on ARM, iirc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end, I removed all atomic loads and stores, as well as the other two mutexes. Everything is under guard by mutex already.

if recursion == 0 {
if atomic_sub_explicit(&b.counter, 1, .Relaxed) == 1 {
atomic_store_explicit(&b.owner, 0, .Release)
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure you forgot to reset the owner to 0, in the other branch of this if statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By virtue of the semaphore post signalling the other waiting thread to unwait at line 457, the next instruction would set the appropriate owner variable. It's only necessary to set the owner if no one else is going to.

Everything was already guarded by `c.mutex`.
@laytan
Copy link
Sponsor Collaborator

laytan commented Sep 16, 2024

I can look into the Darwin issues with this.

@gingerBill gingerBill merged commit 017d6bd into odin-lang:master Sep 16, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants