kernel/queue: Remove interior use of k_poll() #25906

andyross · 2020-06-02T16:33:34Z

The k_queue data structure, when CONFIG_POLL was enabled, would
inexplicably use k_poll() as its blocking mechanism instead of the
original wait_q/pend() code. This was actually racy, see commit
b173e43. The code was structured as a condition variable: using
a spinlock around the queue data before deciding to block. But unlike
pend_current_thread(), k_poll() cannot atomically release a lock.

A workaround had been in place for this, and then accidentally
reverted (both by me!) because the code looked "wrong".

This is just fragile, there's no reason to have two implementations of
k_queue_get(). Remove.

Note that this also removes a test case in the work_queue test where
(when CONFIG_POLL was enabled, but not otherwise) it was checking for
the ability to immediately cancel a delayed work item that was
submitted with a timeout of K_NO_WAIT (i.e. "queue it immediately").
This DOES NOT work with the origina/non-poll queue backend, and has
never been a documented behavior of k_delayed_work_submit_to_queue()
under any circumstances. I don't know why we were testing this.

Fixes #25904

Signed-off-by: Andy Ross andrew.j.ross@intel.com

The k_queue data structure, when CONFIG_POLL was enabled, would inexplicably use k_poll() as its blocking mechanism instead of the original wait_q/pend() code. This was actually racy, see commit b173e43. The code was structured as a condition variable: using a spinlock around the queue data before deciding to block. But unlike pend_current_thread(), k_poll() cannot atomically release a lock. A workaround had been in place for this, and then accidentally reverted (both by me!) because the code looked "wrong". This is just fragile, there's no reason to have two implementations of k_queue_get(). Remove. Note that this also removes a test case in the work_queue test where (when CONFIG_POLL was enabled, but not otherwise) it was checking for the ability to immediately cancel a delayed work item that was submitted with a timeout of K_NO_WAIT (i.e. "queue it immediately"). This DOES NOT work with the origina/non-poll queue backend, and has never been a documented behavior of k_delayed_work_submit_to_queue() under any circumstances. I don't know why we were testing this. Fixes zephyrproject-rtos#25904 Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

nashif · 2020-06-02T17:09:00Z

@andyross can you please add a test for this or is this already covered by other tests?

carlescufi · 2020-06-02T17:09:54Z

@andyross Thanks for the patch. After discussing this PR in the release readiness meeting, we would like to have a test that verifies this doesn't reappear before merging the PR.
EDIT: Note that @joerchan can now cherry-pick this to proceed with the investigation of #23364

pabigot · 2020-06-02T17:11:36Z

I think we need an explicit test for the failing situation in #25904: two threads waiting on a queue, something's added, both threads get woken and the last one doesn't get a value.

joerchan

Tested this PR and the reported problem is now fixed.

When CONFIG_POLL was set, it was historically true that the queue could (if a higher priority thread "stole" an insert) return a spurious NULL instead of continuing to wait on a timeout. This deliberately exercises that race. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

andyross · 2020-06-02T18:19:29Z

OK, test added. There's obviously no way from within CI to validate that this does fail on master, but trust me it does and it's fixed with the first patch. :)

Note that the nature of the fix is such that the race just doesn't make sense (the original loop in k_queue never had that behavior anyway). So at some point we'll want to yank this regression test too, I suspect. But it's low overhead.

carlescufi · 2020-06-02T18:22:33Z

OK, test added. There's obviously no way from within CI to validate that this does fail on master, but trust me it does and it's fixed with the first patch. :)

Note that the nature of the fix is such that the race just doesn't make sense (the original loop in k_queue never had that behavior anyway). So at some point we'll want to yank this regression test too, I suspect. But it's low overhead.

Right, but just in case we ever go back to the old implementation, doesn't hurt to have it for now :)

pabigot · 2020-06-02T18:53:54Z

OK, test added. There's obviously no way from within CI to validate that this does fail on master, but trust me it does and it's fixed with the first patch. :)

Outside of CI the test doesn't fail for me on master. But neither does the case I was concerned about. So given the fix works, it's probably good.

andyross · 2020-06-02T19:25:43Z

FWIW: the base tests/kernel/queue test always worked. It's not using CONFIG_POLL.

There's a separate prj-poll.conf that enables it, and sanitycheck is configured to run both. Basically if you run "sanitycheck -T tests/kernel/queue" you should see this case fail on that half of the test (and, IIRC, some others that happen to enable CONFIG_POLL as part of the platform config).

andyross requested review from andrewboie and nashif as code owners June 2, 2020 16:33

andyross requested review from carlescufi and joerchan June 2, 2020 16:33

zephyrbot added area: API Changes to public APIs area: Tests Issues related to a particular existing or missing test area: Kernel labels Jun 2, 2020

joerchan approved these changes Jun 2, 2020

View reviewed changes

andyross mentioned this pull request Jun 2, 2020

kernel/queue: Re-introduce race workaround #25908

Closed

nashif approved these changes Jun 2, 2020

View reviewed changes

carlescufi approved these changes Jun 2, 2020

View reviewed changes

andrewboie approved these changes Jun 2, 2020

View reviewed changes

carlescufi merged commit 7ff3f8a into zephyrproject-rtos:master Jun 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kernel/queue: Remove interior use of k_poll() #25906

kernel/queue: Remove interior use of k_poll() #25906

andyross commented Jun 2, 2020

nashif commented Jun 2, 2020

carlescufi commented Jun 2, 2020 •

edited

Loading

pabigot commented Jun 2, 2020

joerchan left a comment

andyross commented Jun 2, 2020

carlescufi commented Jun 2, 2020

pabigot commented Jun 2, 2020

andyross commented Jun 2, 2020

kernel/queue: Remove interior use of k_poll() #25906

kernel/queue: Remove interior use of k_poll() #25906

Conversation

andyross commented Jun 2, 2020

nashif commented Jun 2, 2020

carlescufi commented Jun 2, 2020 • edited Loading

pabigot commented Jun 2, 2020

joerchan left a comment

Choose a reason for hiding this comment

andyross commented Jun 2, 2020

carlescufi commented Jun 2, 2020

pabigot commented Jun 2, 2020

andyross commented Jun 2, 2020

carlescufi commented Jun 2, 2020 •

edited

Loading