lib/pthread: Fix create/join race in thread lifecycle #57721

andyross · 2023-05-09T23:35:50Z

The pthread_join() implementation was done using a condition variable and not k_thread_join(), and that's actually fatal. It's not possible in Zephyr to guarantee a thread has exited in a race-free way without calling either k_thread_abort() or k_thread_join().

This opened a hole where an app could call pthread_join(), which would go to sleep waiting on the condition variable, which would be signaled from the exit path of the thread, and then return to call pthread_create() (and thus k_thread_create()) on the same pthread_t object (and thus the same struct k_thread) BEFORE THE EARLIER THREAD HAD ACTUALLY EXITED.

This makes the scheduler blow up, at least on SMP (I wasn't personally able to make it happen on uniprocessor configs, though the race is present always).

Rework to call k_thread_join(). There's no other correct way to do this.

Fixes #56163

andyross · 2023-05-09T23:39:27Z

Note that this may not be completely conformant to POSIX semantics. There's a pthread_detach() implementation using that same condition variable, and I suspect these two need to be unified. Hopefully someone with more POSIX-fu than I can explain the requirements.

But the race needs to be addressed by using k_thread_join().

andyross · 2023-05-09T23:44:04Z

Oh, and to be clear: I tested this with @cfriedt 's rig in #57637 which reverts some recent workarounds to make the bug more visible.

Also: the checkpatch failure is a false positive. POSIX does indeed use positive errno values.

hakehuang

nice shot.

cfriedt · 2023-05-10T15:02:10Z

I'm still working through this. Might need to be a bit patient..

andyross · 2023-05-10T18:03:55Z

I had a chance to look up pthread_detach(), btw. And indeed, the treatment here is insufficient. The documented behavior of pthread_detach() is that it wakes up any threads blocked in a pthread_join() call[1]. And now that that's a k_thread_join() we need some way to do that out of the scheduler (which is probably just as simple as a z_thread_wake_joiners() or something -- just a little lock around an unpend_all() call).

Let me stick a DNM on this for now, pending a little bit of thought.

[1] For the life of me, I can't understand why! The "meaning" of the call is that the OS can clean the thread up instead of holding on to things like its result code and process ID until something retrieves them (analogous to the way unix processes live as zombies until something calls wait()). But... that doesn't mean that you can't join it! This is just inventing a special edge case when none was needed. POSIX, I swear...

POSIX has a somewhat odd behavior that when you call pthread_detach()[1], it is supposed to synchronously wake up anyone currently joining the thread. That isn't a Zephyr kernel behavior[2], so it needs to be exposed. The implementation is as simple as it sounds. This is minimal: we don't try to keep any state around to inform the k_thread_join() call whether the thread actually exited or if this new function was used. That has to be provided by the outer layers (pthreads already do this). [1] Which tells the OS/platform that it can clean the thread up as soon as it exits and doesn't have to keep it around as a zombie until someone calls pthread_join() on it. [2] Because... why!? Not needing to keep a thread around for joiners isn't the same thing as disallowing joins! Signed-off-by: Andy Ross <andyross@google.com>

The pthread_join() implementation was done using a condition variable and not k_thread_join(), and that's actually fatal. It's not possible in Zephyr to guarantee a thread has exited in a race-free way without calling either k_thread_abort() or k_thread_join(). This opened a hole where an app could call pthread_join(), which would go to sleep waiting on the condition variable, which would be signaled from the exit path of the thread, and then return to call pthread_create() (and thus k_thread_create()) on the same pthread_t object (and thus the same struct k_thread) BEFORE THE EARLIER THREAD HAD ACTUALLY EXITED. This makes the scheduler blow up, at least on SMP (I wasn't personally able to make it happen on uniprocessor configs, though the race is present always). Rework to call k_thread_join(). There's no other correct way to do this. Fixes zephyrproject-rtos#56163 Signed-off-by: Andy Ross <andyross@google.com>

andyross · 2023-05-10T19:39:52Z

New version, updated as described. This one seems to be working for me both through a full twister and on the test rig. Please re-review.

cfriedt · 2023-05-11T12:36:34Z

New version, updated as described. This one seems to be working for me both through a full twister and on the test rig. Please re-review.

I unfortunately force-pushed on my branch, but I should be able to test it out without the additional commit.

Will give it a shot shortly.

andyross · 2023-05-12T17:48:00Z

Close this. Patches got folded into #57637 and discussion continues there

andyross requested a review from cfriedt as a code owner May 9, 2023 23:35

zephyrbot added the area: POSIX POSIX API Library label May 9, 2023

zephyrbot assigned cfriedt May 9, 2023

andyross requested review from hakehuang, evgeniy-paltsev and dleach02 May 9, 2023 23:36

hakehuang previously approved these changes May 10, 2023

View reviewed changes

andyross added the DNM This PR should not be merged (Do Not Merge) label May 10, 2023

andyross added 2 commits May 10, 2023 11:43

andyross dismissed hakehuang’s stale review via 48596a1 May 10, 2023 19:38

andyross force-pushed the pthread-join-race branch from ef0e424 to 48596a1 Compare May 10, 2023 19:38

andyross requested review from dcpleung and nashif as code owners May 10, 2023 19:38

andyross removed the DNM This PR should not be merged (Do Not Merge) label May 10, 2023

zephyrbot added the area: Kernel label May 10, 2023

zephyrbot requested review from ceolin and peter-mitsis May 10, 2023 19:39

cfriedt mentioned this pull request May 12, 2023

posix: pthread: correct race condition between pthread_create and pthread_join #57637

Closed

andyross closed this May 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lib/pthread: Fix create/join race in thread lifecycle #57721

lib/pthread: Fix create/join race in thread lifecycle #57721

andyross commented May 9, 2023

andyross commented May 9, 2023

andyross commented May 9, 2023

hakehuang left a comment

cfriedt commented May 10, 2023

andyross commented May 10, 2023

andyross commented May 10, 2023

cfriedt commented May 11, 2023

andyross commented May 12, 2023

lib/pthread: Fix create/join race in thread lifecycle #57721

lib/pthread: Fix create/join race in thread lifecycle #57721

Conversation

andyross commented May 9, 2023

andyross commented May 9, 2023

andyross commented May 9, 2023

hakehuang left a comment

Choose a reason for hiding this comment

cfriedt commented May 10, 2023

andyross commented May 10, 2023

andyross commented May 10, 2023

cfriedt commented May 11, 2023

andyross commented May 12, 2023