-
-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add test suites for core:sync
and core:sync/chan
#4232
Conversation
The calls to `atomic_add*` return the value before adding, not after, so the previous code was causing the occasional data race.
This was occuring about 1/100 times with the test runner's thread pool.
One less value to store, and it should be less of a hack too. Semaphores will not wait around if they have the go-ahead; they depend on an internal value being non-zero, instead of whatever was loaded when they started waiting, which is the case with a `Cond`.
This can prevent a data race on Linux with `Self_Cleanup`.
This will also keep messages from being sent to closed, buffered channels in general.
`w_waiting` is the signal that says a caller is waiting to be able to send something. It is incremented upon send and - in the case of an unbuffered channel - it can only hold one message. Therefore, check that `w_waiting` is zero instead.
A thread made inside a test does not share the test index of its parent, so any time one of those threads failed an assert, it would tell the runner to shutdown test index zero.
Previously, if the owner called this, it would fail.
Regarding the NetBSD CI failure:
I saw this come up intermittently before in my repo's CI, and it only started happening after I fixed the CPU count detection in |
Ah looks like I wrote that wrong, the loop should check for EINTR and loop (continue). |
Regarding the stalls on Darwin, I want to mention that I tried porting the usage of the |
Just a heads up, not sure about FreeBSD and NetBSD kernels, but on linux sometimes running multiple times does not help, since the scheduler seemed very deterministic. I remember when debugging some of my multithreaded code, if it occurred once it was ocurring every single time, and if it never occurred I had trouble triggering it. Not sure what the solution is, except running a shitton of times, while inserting calls to |
I've been doing the bulk of my development and testing on Linux 6.10, and running multiple times often produced different results, especially for non-deterministic deadlocks. The deadlock I commented about for the |
core/sync/chan/chan.odin
Outdated
@@ -164,12 +164,17 @@ send_raw :: proc "contextless" (c: ^Raw_Chan, msg_in: rawptr) -> (ok: bool) { | |||
} | |||
if c.queue != nil { // buffered | |||
sync.guard(&c.mutex) | |||
for c.queue.len == c.queue.cap { | |||
for !sync.atomic_load(&c.closed) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it really be atomically loaded? Every places where c.closed
are accessed are guarded by c.mutex
, meaning the access to the variable is exclusive and acquire-release semantics should apply across other threads accessing the same critical sections.
Help me understand the logic behind the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second look, we might be able to change all c.closed
loads to non-atomic, not just this one. I was following convention on this change, since the other accesses are atomic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this should save an extra lwsync
operation on ARM, iirc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the end, I removed all atomic loads and stores, as well as the other two mutexes. Everything is under guard by mutex
already.
if recursion == 0 { | ||
if atomic_sub_explicit(&b.counter, 1, .Relaxed) == 1 { | ||
atomic_store_explicit(&b.owner, 0, .Release) | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure you forgot to reset the owner to 0
, in the other branch of this if statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By virtue of the semaphore post signalling the other waiting thread to unwait at line 457, the next instruction would set the appropriate owner variable. It's only necessary to set the owner if no one else is going to.
Everything was already guarded by `c.mutex`.
7db0474
to
d38f5ff
Compare
I can look into the Darwin issues with this. |
This patch required much tinkering to get right. In the process, I've fixed over a dozen synchronization bugs, including the failure to detect more than 1 core (!) on FreeBSD and NetBSD in the core library and (for FreeBSD only) the Odin compiler itself. That bugfix itself improves the test runner's speed on those two platforms.
Of note, I converted @laytan's hack in the POSIX thread startup proc to use a semaphore instead of a mutex/condition variable, which should be more robust since it depends on a non-zero state instead of needing to loop on the condition.
I took to the entire test suite with
-sanitize:thread
and eliminated every data race I could find, and with the test suite now testing every primitive, I was able to eliminate several deadlocks. I rewrote the recursive benaphore because I was having trouble avoiding deadlocks with the particular order it took to modifying state. I think it's more robust now, but I'm not 100% certain.I've run these tests locally on Linux, FreeBSD, and NetBSD several hundred times, with and without
-sanitize:thread
on (for Linux only, of course). Given the nature of sync issues, it's necessary to run it multiple times just to be sure.Unfortunately, one of the
core:sync
tests isn't completing on the Darwin CI, which stalls the entire test runner because (I think) that the runner is having an issue with triggering the failure timeout via pthread cancel, which we've discussed previously about its unreliability on Darwin. I'm not sure what the actual underlying issue would be though, with regard to which test or tests are causing a deadlock. I do not have a local Darwin machine to investigate, but if someone could take a look at it, that'd be great. For now, I've marked the test suite as disabled on Darwin.To summarize, we now have basic coverage for every
core:sync
primitive as well aschan.Chan
. There may still be some edge cases forchan
, but I fixed the ones I could find.