-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kernel/queue: Remove interior use of k_poll() #25906
kernel/queue: Remove interior use of k_poll() #25906
Conversation
The k_queue data structure, when CONFIG_POLL was enabled, would inexplicably use k_poll() as its blocking mechanism instead of the original wait_q/pend() code. This was actually racy, see commit b173e43. The code was structured as a condition variable: using a spinlock around the queue data before deciding to block. But unlike pend_current_thread(), k_poll() cannot atomically release a lock. A workaround had been in place for this, and then accidentally reverted (both by me!) because the code looked "wrong". This is just fragile, there's no reason to have two implementations of k_queue_get(). Remove. Note that this also removes a test case in the work_queue test where (when CONFIG_POLL was enabled, but not otherwise) it was checking for the ability to immediately cancel a delayed work item that was submitted with a timeout of K_NO_WAIT (i.e. "queue it immediately"). This DOES NOT work with the origina/non-poll queue backend, and has never been a documented behavior of k_delayed_work_submit_to_queue() under any circumstances. I don't know why we were testing this. Fixes zephyrproject-rtos#25904 Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
@andyross can you please add a test for this or is this already covered by other tests? |
I think we need an explicit test for the failing situation in #25904: two threads waiting on a queue, something's added, both threads get woken and the last one doesn't get a value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested this PR and the reported problem is now fixed.
When CONFIG_POLL was set, it was historically true that the queue could (if a higher priority thread "stole" an insert) return a spurious NULL instead of continuing to wait on a timeout. This deliberately exercises that race. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
OK, test added. There's obviously no way from within CI to validate that this does fail on master, but trust me it does and it's fixed with the first patch. :) Note that the nature of the fix is such that the race just doesn't make sense (the original loop in k_queue never had that behavior anyway). So at some point we'll want to yank this regression test too, I suspect. But it's low overhead. |
Right, but just in case we ever go back to the old implementation, doesn't hurt to have it for now :) |
Outside of CI the test doesn't fail for me on master. But neither does the case I was concerned about. So given the fix works, it's probably good. |
FWIW: the base tests/kernel/queue test always worked. It's not using CONFIG_POLL. There's a separate prj-poll.conf that enables it, and sanitycheck is configured to run both. Basically if you run "sanitycheck -T tests/kernel/queue" you should see this case fail on that half of the test (and, IIRC, some others that happen to enable CONFIG_POLL as part of the platform config). |
The k_queue data structure, when CONFIG_POLL was enabled, would
inexplicably use k_poll() as its blocking mechanism instead of the
original wait_q/pend() code. This was actually racy, see commit
b173e43. The code was structured as a condition variable: using
a spinlock around the queue data before deciding to block. But unlike
pend_current_thread(), k_poll() cannot atomically release a lock.
A workaround had been in place for this, and then accidentally
reverted (both by me!) because the code looked "wrong".
This is just fragile, there's no reason to have two implementations of
k_queue_get(). Remove.
Note that this also removes a test case in the work_queue test where
(when CONFIG_POLL was enabled, but not otherwise) it was checking for
the ability to immediately cancel a delayed work item that was
submitted with a timeout of K_NO_WAIT (i.e. "queue it immediately").
This DOES NOT work with the origina/non-poll queue backend, and has
never been a documented behavior of k_delayed_work_submit_to_queue()
under any circumstances. I don't know why we were testing this.
Fixes #25904
Signed-off-by: Andy Ross andrew.j.ross@intel.com