Add test suites for `core:sync` and `core:sync/chan` #4232

Feoramund · 2024-09-11T19:05:29Z

This patch required much tinkering to get right. In the process, I've fixed over a dozen synchronization bugs, including the failure to detect more than 1 core (!) on FreeBSD and NetBSD in the core library and (for FreeBSD only) the Odin compiler itself. That bugfix itself improves the test runner's speed on those two platforms.

Of note, I converted @laytan's hack in the POSIX thread startup proc to use a semaphore instead of a mutex/condition variable, which should be more robust since it depends on a non-zero state instead of needing to loop on the condition.

I took to the entire test suite with -sanitize:thread and eliminated every data race I could find, and with the test suite now testing every primitive, I was able to eliminate several deadlocks. I rewrote the recursive benaphore because I was having trouble avoiding deadlocks with the particular order it took to modifying state. I think it's more robust now, but I'm not 100% certain.

I've run these tests locally on Linux, FreeBSD, and NetBSD several hundred times, with and without -sanitize:thread on (for Linux only, of course). Given the nature of sync issues, it's necessary to run it multiple times just to be sure.

Unfortunately, one of the core:sync tests isn't completing on the Darwin CI, which stalls the entire test runner because (I think) that the runner is having an issue with triggering the failure timeout via pthread cancel, which we've discussed previously about its unreliability on Darwin. I'm not sure what the actual underlying issue would be though, with regard to which test or tests are causing a deadlock. I do not have a local Darwin machine to investigate, but if someone could take a look at it, that'd be great. For now, I've marked the test suite as disabled on Darwin.

To summarize, we now have basic coverage for every core:sync primitive as well as chan.Chan. There may still be some edge cases for chan, but I fixed the ones I could find.

The calls to `atomic_add*` return the value before adding, not after, so the previous code was causing the occasional data race.

This was occuring about 1/100 times with the test runner's thread pool.

One less value to store, and it should be less of a hack too. Semaphores will not wait around if they have the go-ahead; they depend on an internal value being non-zero, instead of whatever was loaded when they started waiting, which is the case with a `Cond`.

This can prevent a data race on Linux with `Self_Cleanup`.

This will also keep messages from being sent to closed, buffered channels in general.

`w_waiting` is the signal that says a caller is waiting to be able to send something. It is incremented upon send and - in the case of an unbuffered channel - it can only hold one message. Therefore, check that `w_waiting` is zero instead.

A thread made inside a test does not share the test index of its parent, so any time one of those threads failed an assert, it would tell the runner to shutdown test index zero.

Previously, if the owner called this, it would fail.

Feoramund · 2024-09-11T19:18:16Z

Regarding the NetBSD CI failure:

[ERROR] --- [2024-09-11 19:10:43] [structs.odin:66:wait_for()] waitpid() failure: Interrupted system call

I saw this come up intermittently before in my repo's CI, and it only started happening after I fixed the CPU count detection in core. I haven't been able to replicate it locally, and I am not sure yet what the fix would be.

laytan · 2024-09-11T19:25:32Z

I haven't been able to replicate it locally, and I am not sure yet what the fix would be.

Ah looks like I wrote that wrong, the loop should check for EINTR and loop (continue).

Feoramund · 2024-09-11T19:47:45Z

Regarding the stalls on Darwin, I want to mention that I tried porting the usage of the SHARED flags that were introduced into the compiler by 0342617, into the usage of ulock_* and os_sync_* in the Odin core library as a possible fix to my repo's CI, but it continued to stall.

flysand7 · 2024-09-12T06:11:50Z

I've run these tests locally on Linux, FreeBSD, and NetBSD several hundred times, with and without -sanitize:thread on (for Linux only, of course). Given the nature of sync issues, it's necessary to run it multiple times just to be sure.

Just a heads up, not sure about FreeBSD and NetBSD kernels, but on linux sometimes running multiple times does not help, since the scheduler seemed very deterministic. I remember when debugging some of my multithreaded code, if it occurred once it was ocurring every single time, and if it never occurred I had trouble triggering it.

Not sure what the solution is, except running a shitton of times, while inserting calls to sched_yield at some random places (usually before taking or releasing locks). But yeah It's just a thing I've noticed, maybe since 5.0 linux did change their scheduler or something, I'm not sure.

Feoramund · 2024-09-12T09:27:28Z

Just a heads up, not sure about FreeBSD and NetBSD kernels, but on linux sometimes running multiple times does not help, since the scheduler seemed very deterministic.

I've been doing the bulk of my development and testing on Linux 6.10, and running multiple times often produced different results, especially for non-deterministic deadlocks. The deadlock I commented about for the Auto_Reset_Event test was one of these, and the recursive benaphore demonstrated another. Sometimes they would pass, sometimes not. Sometimes they would fail more often if run with -define:ODIN_TEST_THREADS=1 or with the complete normal.odin test suite. There was little in the way of predictability about it.

flysand7 · 2024-09-12T23:21:12Z

core/sync/chan/chan.odin

@@ -164,12 +164,17 @@ send_raw :: proc "contextless" (c: ^Raw_Chan, msg_in: rawptr) -> (ok: bool) {
 	}
 	if c.queue != nil { // buffered
 		sync.guard(&c.mutex)
-		for c.queue.len == c.queue.cap {
+		for !sync.atomic_load(&c.closed) &&


Does it really be atomically loaded? Every places where c.closed are accessed are guarded by c.mutex, meaning the access to the variable is exclusive and acquire-release semantics should apply across other threads accessing the same critical sections.

Help me understand the logic behind the change.

On second look, we might be able to change all c.closed loads to non-atomic, not just this one. I was following convention on this change, since the other accesses are atomic.

Yeah, this should save an extra lwsync operation on ARM, iirc

In the end, I removed all atomic loads and stores, as well as the other two mutexes. Everything is under guard by mutex already.

flysand7 · 2024-09-13T00:00:13Z

core/sync/extended.odin

-		if recursion == 0 {
+		if atomic_sub_explicit(&b.counter, 1, .Relaxed) == 1 {
+			atomic_store_explicit(&b.owner, 0, .Release)
+		} else {


Pretty sure you forgot to reset the owner to 0, in the other branch of this if statement.

By virtue of the semaphore post signalling the other waiting thread to unwait at line 457, the next instruction would set the appropriate owner variable. It's only necessary to set the owner if no one else is going to.

Everything was already guarded by `c.mutex`.

laytan · 2024-09-16T14:41:41Z

I can look into the Darwin issues with this.

Feoramund added 21 commits September 9, 2024 16:19

Fix sync.Benaphore

9d6f71f

The calls to `atomic_add*` return the value before adding, not after, so the previous code was causing the occasional data race.

Fix rare double-join possibility in POSIX thread._join

74b28f1

This was occuring about 1/100 times with the test runner's thread pool.

Fix data race in atomic_sema_wait_with_timeout

cbd4d5e

Use more atomic handling of thread flags

45da009

This can prevent a data race on Linux with `Self_Cleanup`.

Fix atomic memory order for sync.ticket_mutex_unlock

dbb783f

Fix data race when pool_stop_task is called

c3f363c

Use contextless procs in core:sync instead

0a59414

Keep chan.can_recv from deadlocking

73f5ab4

Fix deadlock on sending to full, buffered, closed Chan

026aef6

This will also keep messages from being sent to closed, buffered channels in general.

Forbid chan.try_send on closed buffered channels

e9a6a34

Fix chan.can_send for unbuffered channels

8a14a65

`w_waiting` is the signal that says a caller is waiting to be able to send something. It is incremented upon send and - in the case of an unbuffered channel - it can only hold one message. Therefore, check that `w_waiting` is zero instead.

Fix data race in test_core_flags

074314b

Fix signalling test child threads crashing test 0

3a60109

A thread made inside a test does not share the test index of its parent, so any time one of those threads failed an assert, it would tell the runner to shutdown test index zero.

Fix recursive_benaphore_try_lock

b2c2235

Previously, if the owner called this, it would fail.

Fix data races in sync.Recursive_Benaphore

fec1ccd

Fix deadlock in Auto_Reset_Event

a1435a6

Add cpu_relax to sync.auto_reset_event_signal

b1db33b

Fix CPU count detection in FreeBSD & NetBSD

2938655

Fix comments

16cd16b

Add tests for core:sync and core:sync/chan

7f7cfeb

flysand7 reviewed Sep 13, 2024

View reviewed changes

Remove unneeded synchronizations in Chan

d38f5ff

Everything was already guarded by `c.mutex`.

Feoramund force-pushed the test-sync branch from 7db0474 to d38f5ff Compare September 16, 2024 02:59

Check for EINTR in sys/posix test

16ef597

gingerBill approved these changes Sep 16, 2024

View reviewed changes

gingerBill merged commit 017d6bd into odin-lang:master Sep 16, 2024
7 checks passed

pkova mentioned this pull request Sep 16, 2024

Fix core sync test deadlock on darwin #4253

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add test suites for `core:sync` and `core:sync/chan` #4232

Add test suites for `core:sync` and `core:sync/chan` #4232

Feoramund commented Sep 11, 2024

Feoramund commented Sep 11, 2024

laytan commented Sep 11, 2024

Feoramund commented Sep 11, 2024

flysand7 commented Sep 12, 2024

Feoramund commented Sep 12, 2024

flysand7 Sep 12, 2024

Feoramund Sep 16, 2024

flysand7 Sep 16, 2024

Feoramund Sep 16, 2024

flysand7 Sep 13, 2024

Feoramund Sep 16, 2024

laytan commented Sep 16, 2024

Add test suites for core:sync and core:sync/chan #4232

Add test suites for core:sync and core:sync/chan #4232

Conversation

Feoramund commented Sep 11, 2024

Feoramund commented Sep 11, 2024

laytan commented Sep 11, 2024

Feoramund commented Sep 11, 2024

flysand7 commented Sep 12, 2024

Feoramund commented Sep 12, 2024

flysand7 Sep 12, 2024

Choose a reason for hiding this comment

Feoramund Sep 16, 2024

Choose a reason for hiding this comment

flysand7 Sep 16, 2024

Choose a reason for hiding this comment

Feoramund Sep 16, 2024

Choose a reason for hiding this comment

flysand7 Sep 13, 2024

Choose a reason for hiding this comment

Feoramund Sep 16, 2024

Choose a reason for hiding this comment

laytan commented Sep 16, 2024

Add test suites for `core:sync` and `core:sync/chan` #4232

Add test suites for `core:sync` and `core:sync/chan` #4232