Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel/queue: Remove interior use of k_poll() #25906

Merged
merged 2 commits into from
Jun 2, 2020

Conversation

andyross
Copy link
Contributor

@andyross andyross commented Jun 2, 2020

The k_queue data structure, when CONFIG_POLL was enabled, would
inexplicably use k_poll() as its blocking mechanism instead of the
original wait_q/pend() code. This was actually racy, see commit
b173e43. The code was structured as a condition variable: using
a spinlock around the queue data before deciding to block. But unlike
pend_current_thread(), k_poll() cannot atomically release a lock.

A workaround had been in place for this, and then accidentally
reverted (both by me!) because the code looked "wrong".

This is just fragile, there's no reason to have two implementations of
k_queue_get(). Remove.

Note that this also removes a test case in the work_queue test where
(when CONFIG_POLL was enabled, but not otherwise) it was checking for
the ability to immediately cancel a delayed work item that was
submitted with a timeout of K_NO_WAIT (i.e. "queue it immediately").
This DOES NOT work with the origina/non-poll queue backend, and has
never been a documented behavior of k_delayed_work_submit_to_queue()
under any circumstances. I don't know why we were testing this.

Fixes #25904

Signed-off-by: Andy Ross andrew.j.ross@intel.com

The k_queue data structure, when CONFIG_POLL was enabled, would
inexplicably use k_poll() as its blocking mechanism instead of the
original wait_q/pend() code.  This was actually racy, see commit
b173e43.  The code was structured as a condition variable: using
a spinlock around the queue data before deciding to block.  But unlike
pend_current_thread(), k_poll() cannot atomically release a lock.

A workaround had been in place for this, and then accidentally
reverted (both by me!) because the code looked "wrong".

This is just fragile, there's no reason to have two implementations of
k_queue_get().  Remove.

Note that this also removes a test case in the work_queue test where
(when CONFIG_POLL was enabled, but not otherwise) it was checking for
the ability to immediately cancel a delayed work item that was
submitted with a timeout of K_NO_WAIT (i.e. "queue it immediately").
This DOES NOT work with the origina/non-poll queue backend, and has
never been a documented behavior of k_delayed_work_submit_to_queue()
under any circumstances.  I don't know why we were testing this.

Fixes zephyrproject-rtos#25904

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
@andyross andyross requested review from carlescufi and joerchan June 2, 2020 16:33
@zephyrbot zephyrbot added area: API Changes to public APIs area: Tests Issues related to a particular existing or missing test area: Kernel labels Jun 2, 2020
@nashif
Copy link
Member

nashif commented Jun 2, 2020

@andyross can you please add a test for this or is this already covered by other tests?

@carlescufi
Copy link
Member

carlescufi commented Jun 2, 2020

@andyross Thanks for the patch. After discussing this PR in the release readiness meeting, we would like to have a test that verifies this doesn't reappear before merging the PR.
EDIT: Note that @joerchan can now cherry-pick this to proceed with the investigation of #23364

@pabigot
Copy link
Collaborator

pabigot commented Jun 2, 2020

I think we need an explicit test for the failing situation in #25904: two threads waiting on a queue, something's added, both threads get woken and the last one doesn't get a value.

Copy link
Contributor

@joerchan joerchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this PR and the reported problem is now fixed.

When CONFIG_POLL was set, it was historically true that the queue
could (if a higher priority thread "stole" an insert) return a
spurious NULL instead of continuing to wait on a timeout.

This deliberately exercises that race.

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
@andyross
Copy link
Contributor Author

andyross commented Jun 2, 2020

OK, test added. There's obviously no way from within CI to validate that this does fail on master, but trust me it does and it's fixed with the first patch. :)

Note that the nature of the fix is such that the race just doesn't make sense (the original loop in k_queue never had that behavior anyway). So at some point we'll want to yank this regression test too, I suspect. But it's low overhead.

@carlescufi
Copy link
Member

OK, test added. There's obviously no way from within CI to validate that this does fail on master, but trust me it does and it's fixed with the first patch. :)

Note that the nature of the fix is such that the race just doesn't make sense (the original loop in k_queue never had that behavior anyway). So at some point we'll want to yank this regression test too, I suspect. But it's low overhead.

Right, but just in case we ever go back to the old implementation, doesn't hurt to have it for now :)

@pabigot
Copy link
Collaborator

pabigot commented Jun 2, 2020

OK, test added. There's obviously no way from within CI to validate that this does fail on master, but trust me it does and it's fixed with the first patch. :)

Outside of CI the test doesn't fail for me on master. But neither does the case I was concerned about. So given the fix works, it's probably good.

@andyross
Copy link
Contributor Author

andyross commented Jun 2, 2020

FWIW: the base tests/kernel/queue test always worked. It's not using CONFIG_POLL.

There's a separate prj-poll.conf that enables it, and sanitycheck is configured to run both. Basically if you run "sanitycheck -T tests/kernel/queue" you should see this case fail on that half of the test (and, IIRC, some others that happen to enable CONFIG_POLL as part of the platform config).

@carlescufi carlescufi merged commit 7ff3f8a into zephyrproject-rtos:master Jun 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: API Changes to public APIs area: Kernel area: Tests Issues related to a particular existing or missing test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kernel: k_queue_get return NULL before timeout
7 participants