Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lib/pthread: Fix create/join race in thread lifecycle #57721

Closed
wants to merge 2 commits into from

Conversation

andyross
Copy link
Contributor

@andyross andyross commented May 9, 2023

The pthread_join() implementation was done using a condition variable and not k_thread_join(), and that's actually fatal. It's not possible in Zephyr to guarantee a thread has exited in a race-free way without calling either k_thread_abort() or k_thread_join().

This opened a hole where an app could call pthread_join(), which would go to sleep waiting on the condition variable, which would be signaled from the exit path of the thread, and then return to call pthread_create() (and thus k_thread_create()) on the same pthread_t object (and thus the same struct k_thread) BEFORE THE EARLIER THREAD HAD ACTUALLY EXITED.

This makes the scheduler blow up, at least on SMP (I wasn't personally able to make it happen on uniprocessor configs, though the race is present always).

Rework to call k_thread_join(). There's no other correct way to do this.

Fixes #56163

@andyross andyross requested a review from cfriedt as a code owner May 9, 2023 23:35
@zephyrbot zephyrbot added the area: POSIX POSIX API Library label May 9, 2023
@andyross
Copy link
Contributor Author

andyross commented May 9, 2023

Note that this may not be completely conformant to POSIX semantics. There's a pthread_detach() implementation using that same condition variable, and I suspect these two need to be unified. Hopefully someone with more POSIX-fu than I can explain the requirements.

But the race needs to be addressed by using k_thread_join().

@andyross
Copy link
Contributor Author

andyross commented May 9, 2023

Oh, and to be clear: I tested this with @cfriedt 's rig in #57637 which reverts some recent workarounds to make the bug more visible.

Also: the checkpatch failure is a false positive. POSIX does indeed use positive errno values.

hakehuang
hakehuang previously approved these changes May 10, 2023
Copy link
Collaborator

@hakehuang hakehuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice shot.

@cfriedt
Copy link
Member

cfriedt commented May 10, 2023

I'm still working through this. Might need to be a bit patient..

@andyross
Copy link
Contributor Author

I had a chance to look up pthread_detach(), btw. And indeed, the treatment here is insufficient. The documented behavior of pthread_detach() is that it wakes up any threads blocked in a pthread_join() call[1]. And now that that's a k_thread_join() we need some way to do that out of the scheduler (which is probably just as simple as a z_thread_wake_joiners() or something -- just a little lock around an unpend_all() call).

Let me stick a DNM on this for now, pending a little bit of thought.

[1] For the life of me, I can't understand why! The "meaning" of the call is that the OS can clean the thread up instead of holding on to things like its result code and process ID until something retrieves them (analogous to the way unix processes live as zombies until something calls wait()). But... that doesn't mean that you can't join it! This is just inventing a special edge case when none was needed. POSIX, I swear...

@andyross andyross added the DNM This PR should not be merged (Do Not Merge) label May 10, 2023
andyross added 2 commits May 10, 2023 11:43
POSIX has a somewhat odd behavior that when you call
pthread_detach()[1], it is supposed to synchronously wake up anyone
currently joining the thread.  That isn't a Zephyr kernel behavior[2],
so it needs to be exposed.

The implementation is as simple as it sounds.  This is minimal: we
don't try to keep any state around to inform the k_thread_join() call
whether the thread actually exited or if this new function was used.
That has to be provided by the outer layers (pthreads already do
this).

[1] Which tells the OS/platform that it can clean the thread up as
soon as it exits and doesn't have to keep it around as a zombie until
someone calls pthread_join() on it.

[2] Because... why!?  Not needing to keep a thread around for joiners
isn't the same thing as disallowing joins!

Signed-off-by: Andy Ross <andyross@google.com>
The pthread_join() implementation was done using a condition variable
and not k_thread_join(), and that's actually fatal.  It's not possible
in Zephyr to guarantee a thread has exited in a race-free way without
calling either k_thread_abort() or k_thread_join().

This opened a hole where an app could call pthread_join(), which would
go to sleep waiting on the condition variable, which would be signaled
from the exit path of the thread, and then return to call
pthread_create() (and thus k_thread_create()) on the same pthread_t
object (and thus the same struct k_thread) BEFORE THE EARLIER THREAD
HAD ACTUALLY EXITED.

This makes the scheduler blow up, at least on SMP (I wasn't personally
able to make it happen on uniprocessor configs, though the race is
present always).

Rework to call k_thread_join().  There's no other correct way to do
this.

Fixes zephyrproject-rtos#56163

Signed-off-by: Andy Ross <andyross@google.com>
@andyross andyross force-pushed the pthread-join-race branch from ef0e424 to 48596a1 Compare May 10, 2023 19:38
@andyross andyross requested review from dcpleung and nashif as code owners May 10, 2023 19:38
@andyross andyross removed the DNM This PR should not be merged (Do Not Merge) label May 10, 2023
@zephyrbot zephyrbot requested review from ceolin and peter-mitsis May 10, 2023 19:39
@andyross
Copy link
Contributor Author

New version, updated as described. This one seems to be working for me both through a full twister and on the test rig. Please re-review.

@cfriedt
Copy link
Member

cfriedt commented May 11, 2023

New version, updated as described. This one seems to be working for me both through a full twister and on the test rig. Please re-review.

I unfortunately force-pushed on my branch, but I should be able to test it out without the additional commit.

Will give it a shot shortly.

@andyross
Copy link
Contributor Author

Close this. Patches got folded into #57637 and discussion continues there

@andyross andyross closed this May 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

posix: pthread: race condition between pthread_create() and pthread_join()
4 participants