Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel: add new work queue implementation #29618

Merged
merged 13 commits into from
Mar 4, 2021

Conversation

pabigot
Copy link
Collaborator

@pabigot pabigot commented Oct 28, 2020

This PR replaces #28891 as a candidate solution for #27356 by addressing the concerns identified in #28891 (comment).

It provides a complete reimplementation of the work queue infrastructure intended to eliminate the race conditions and feature gaps in the existing implementation. The general functionality is based on that provided by Linux.

Both bare and delayable work structures are supported. Items can be submitted; delayable items can be scheduled for submission at a future time. Items can be delayed, queued, and running all at the same time. A running item can also be canceling.

The new implementation:

  • replaces "pending" with "busy" which identifies the active states;
  • supports canceling delayed and submitted items;
  • prevents resubmission of a item being canceled until cancellation completes;
  • supports waiting for cancellation to complete;
  • supports flushing a work item (waiting for the last submission to complete without preventing resubmission);
  • supports waiting for a queue to drain (only allows resubmission from the work thread);
  • prevents handler-reentrancy during resubmission.

Details on the API changes are in this comment.

Fixes #27356

Plan for Merge

There is a strong desire to get this in to 2.5 so that the legacy API could be removed for 2.6 (LTS) if TSC approves an exception. IIRC discussion off-line suggested that this API could be restricted to supervisory mode use, however there is a large open question regarding how/whether user mode needs a similar capability.

Current steps required to merge this (rm DRAFT are conditions that block removing the DNM status from this)

  • Merge dependent PR kernel: const-qualify objects used to calculate delay values #30073: kernel: const-qualify objects used to calculate delay values
  • Remove the old implementation (currently unselected by Kconfig): NOTE This causes massive irrelevant changes to be observed in the start to finish diff of include/kernel.h. See the individual commits to confirm that only the old API was removed.
  • Document which parts of the legacy API will remain, and which parts are to be deprecated. See this comment.
  • Full test and code coverage
  • Determine the degree to which k_work_submit_to_user_queue must be supported. (Needs input from out-of-tree users: Zephyr itself does not use it). RESOLUTION: provide new API with the same function names and behavior, but different signatures since k_work and k_workq cannot be shared between supervisor and userspace interfaces.)
  • If necessary, provide a deprecated work around for legacy user queue API. RESOLUTION Not applicable; the replacement API will not be backward compatible in signature, but will be compatible in implementation.
  • Determine whether WIP kernel: define and use a formal scheduling API #29668 is going anywhere and finalize the method for handling scheduling in the implementation. RESOLUTION: Extracted the used functions as kernel private functions until we confirm they're suitable for public exposure.
  • Pass CI on all platforms (confirmed possible)
  • Confirm that deprecated usages in-tree (largely networking and bluetooth) can be successfully replaced by the new API, probably via draft PRs based on this PR.

@github-actions github-actions bot added area: API Changes to public APIs area: Kernel area: Tests Issues related to a particular existing or missing test labels Oct 28, 2020
@pabigot pabigot requested a review from andrewboie October 28, 2020 18:37
@pabigot
Copy link
Collaborator Author

pabigot commented Oct 28, 2020

For posterity here are the design notes prior to implementation, which may provide some high-level insight of intent. The material is obsolete as implementation often identified a better path or desired behavior.

Expand for content

Design questions:

  • Is there a need to be able to dynamically create and destroy workqueue objects and threads? Absent compelling justification the answer is "no".

Queue States and Actions

Design requirements:

Terminology

chained work is work that is submitted from a work queue thread to the same work queue.

States

A queue has a lock that protects the other members of the state.

A queue has a work list which is a sequence of items to be processed.

A queue has a drain waitq that holds threads that are waiting for it to drain.

A queue has a set of flags including:

  • draining which indicates that it has been instructed accept only chained work.

Actions

A submit action [any] provides an item. It is invoked when the global
work item lock is held. Atomically:

  • If the queue is not draining or the work item is chained:
    • the item is added to the work list;
    • the work item queued bit is set;
    • the work thread is notified
      The action result indicates whether the inner actions were taken.

A cancel action [external] provides an item. It is invoked when the global work item lock is held. Atomically:

  • If the item is removed from the queue:
    • the work item queued bit is cleared;
    • the work thread is notified

A check action [internal] is invoked whenever the work thread is notified or completes a process action. Atomically:

  • Any pending notification is cleared.
  • If no work item is available:
    • if the queue is draining the drain waitq is woken
  • If a work item is fetched then:
    • the work item queued bit is cleared;
    • the work item running bit is set;
      The lock is released. If a work item was fetched a process action is initiated, otherwise the work thread waits for notification. On completion of either the check action is re-initiated.

A process action [internal] executes the handler of a work item that has been marked running. When the handler returns, atomically:

  • The item running state is cleared;
  • The item cancelling state is cleared;
  • All processes waiting for completion of the work item are made ready.

A drain action [external] indicates that the queue needs to be drained to empty. Atomically:

  • If the work list is empty and the work thread is idle, return false; otherwise
  • Mark the queue draining and pend on the drain waitq.

A flush action [external] indicates that the caller must wait until completion of a given work item should be signalled.

Minutiae

Flush and synchronous cancel operations involve pushing a marker onto the work list immediately after whatever work item is being flushed or cancelled.

Item States and Actions

Design requirements:

  • It must be possible to synchronously cancel a non-delayable work item in the sense of Linux cancel_work_sync.
    (If queued, remove from queue. Otherwise block attempts to resubmit and flush.)

    Note: This differs from flush_work in that any attempts to resubmit the work before cancellation completes will be rejected.

  • {TBD} It must be possible to asynchronously (ISR) cancel a delayable work item in the sense of Linux cancel_delayed_work.
    (If scheduled, abort timeout. If queued, remove from queue.)

  • It must be possible to synchronously cancel a delayable work item in the sense of Linux cancel_delayed_work_sync.

  • It must be possible to synchronously flush a non-delayable work item in the sense of Linux flush_work. (Basically inserts a notifier in the work queue immediately after the
    work item.)

  • It must be possible to synchronously flush a delayable work item in the sense of Linux flush_delayed_work. (If delayed, the timeout is cancelled and the item immediately queued. Otherwise equivalent to flush_work().)

States

There is a global work item lock that protects the flag

A work item has a node that allows it to be linked into work lists.

A work item has a handler function that is invoked when it is de-queued by a work thread.

A work item has a set of flags including:

  • delayed if there is a timeout that will submit the item to a specific workqueue when the delay expires;
  • queued if it is on the queue for a work thread to process;
  • running (??) if a work thread is running its handler (this is required to ensure non-reentrancy when submitting. It may be derived from a non-null workq pointer in the work object.)
  • cancelling if it has been cancelled and is also running.
  • flusher if the item is a flush marker;

An item is owned if it is in at least one state of delayed, queued, or cancelling.

Note that idle is the only distinct state: a work item may be simultaneously delayed, queued, and running.

Events

A submit action (ISR-OK) provides an item and a workqueue. Atomically:

  • If the item is running, the provided workqueue is replaced with the workqueue running the item; then
  • If the item is not queued and not cancelling, it is submitted to the workqueue. If the submission is successful the item is marked queued.

The action result indicates whether the item was queued.

A process action from the work thread mutates the queued and running flags.

A complete action from the work thread mutates the running and cancelling flags.

A schedule action (ISR-OK) provides a work item, a workqueue, and a delay. Atomically:

  • If the item is not owned:
    • if the delay is zero, a submit action is initiated for the item and the workqueue; otherwise
    • a timeout is scheduled to perform a submit operation for the item and the workqueue after the delay expires
      The action result indicates whether the item was owned.

A reschedule action (ISR-OK) provides a work item, a workqueue, and a delay. Atomically:

  • if the item is delayed, the timeout is cancelled and the delayed flag is cleared;
  • evaluate the body of the schedule action when the item is not owned.
    The action result indicates whether the item was originally owned.

A flush action (thread-only) provides a work item.

  • If the item is running, the calling thread blocks on the waitq for notification when the item completes.
    Note: Nothing prevents scheduling or submitting the item again; the notification occurs only when then the currently running instance completes.

A cancel action (ISR-OK) provides a work item. Atomically:

  • If the item is delayed, its timeout is cancelled and the delayed state cleared;
  • If the item is queued, it is removed from its queue and the queued state cleared;
  • If the item is running, it is marked as cancelling.

A cancel-sync action (thread-only) provides a work item. Atomically:

  • a cancel action is executed; then
  • if the item is marked cancelling a flush action is initiated.

@pabigot

This comment has been minimized.

@eanderlind
Copy link

@pabigot Rather than redefine all the k_*work structs (and thus all the apis), could you retain
most of the existing struct definitions and use CONTAINER_OF() macro (if permitted in kernel) to find
an encapsulating new structure that contains any new fields?
Eg rather than adding struct k_work2_queue *queue; to k_work (forcing lots of api redfinitions), can one use CONTAINER_OF to find the queue pointer in k_work2_delayable (or even better k_work_delayable?
I suspect you could use this technique to limit number of new structs to one or two. This would simplify migration to the new api. (Eg k_work2_queue could be defined using pre-existing structs k_queue, etc

@pabigot
Copy link
Collaborator Author

pabigot commented Nov 2, 2020

@pabigot Rather than redefine all the k_*work structs (and thus all the apis), could you retain
most of the existing struct definitions and use CONTAINER_OF() macro (if permitted in kernel) to find
an encapsulating new structure that contains any new fields?

Probably not. At the moment my goal is confirming that this meets the desired capabilities (i.e. avoids the breakage of the race conditions) and is acceptable to the kernel folks. Next step is seeing whether the existing API can be implemented in terms of the new one. I'm not particularly interested in preserving the actual content of existing structures.

@eanderlind
Copy link

@pabigot Another attempt to explain. Due to the definition of a new k_work2 struct, it looks like a system that supports the new and old work queue apis simultaneously would require two system work queues in parallel with separate threads for each (since k_work2_queue_start() creates the new thread.) This situation is very likely during a multi-release transition period. A better architecture would be to have a single shared system queue, since it is much easier to reason about and debug. It also avoids frequent swapping between the two threads. To discuss whether there is an internal flag to check that caller doesn't mix and match new and old api calls. Keeping k_work as is would permit reducing the number of new apis and reduce the amount of effort for application/driver maintainers to migrate to the new api. Apis such as k_work_handler and k_work_init could be shared between both api versions.

@pabigot
Copy link
Collaborator Author

pabigot commented Nov 3, 2020

Another attempt to respond:

The existing API does not meet community needs. There is no way to precisely determine the state of any work item, nor to be informed when it's safe to shut down a work queue, nor to submit another work item without risking duplicate handling because of race conditions.

This PR provides the work queue capabilities that I believe Zephyr needs, with many desirable functional improvements, grounded on an architectural approach from Linux that's proven to be generally satisfactory.

The first milestone is whether this implementation is acceptable to the kernel folks on technical grounds. The second milestone is whether the existing API can be implemented in terms of it without significant behavioral differences. The third milestone is whether the new capabilities are exposed in a way that fits how applications will want to use them.

We're at the first milestone. That's why this is in draft.

Yes, in an ideal world we would have only one workqueue implementation: there would not be a k_work2*, just k_work* with different internals and new API adjacent to the existing API. That's a goal. It is not yet clear that can be reached (there are very weird behaviors that application code may be depending on), but if it can, it isn't going to be accomplished by extending the existing implementation. The data structures and state management it uses are the wrong ones; for example k_queue is a locked data-structure, but precise tracking of work state requires a lock with wider scope, so k_queue gives no benefit over sys_slist_t.

If we're stuck at the first milestone for more than a week it'll probably work back to the top of my queue and I'll start on checking compatibility, but for now I'm working other issues while waiting for kernel review.

@pabigot

This comment has been minimized.

@pabigot pabigot force-pushed the nordic/work2 branch 4 times, most recently from 211b5cd to 551aba9 Compare November 19, 2020 17:18
@nashif
Copy link
Member

nashif commented Nov 19, 2020

@pabigot not sure what you did to kernel.h, but it is almost changing every line and makes it unreviewable here, it looks like it was completely rewritten:

include/kernel.h       | 2687 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------------

@pabigot
Copy link
Collaborator Author

pabigot commented Nov 19, 2020

@pabigot not sure what you did to kernel.h, but it is almost changing every line and makes it unreviewable here, it looks like it was completely rewritten:

It's actually only moving a large block that had k_work down below k_mutex and others, then adding the block that has `k_work2, but because there's so much similarity across the functions the diff ends up being horrifying. Before this comes out of draft it'll get cleaned up.

@nashif nashif requested a review from andyross November 20, 2020 13:07
@pabigot pabigot force-pushed the nordic/work2 branch 7 times, most recently from 5dba9cf to 711c89e Compare November 21, 2020 19:14
@pabigot pabigot added the DNM This PR should not be merged (Do Not Merge) label Mar 2, 2021
@github-actions github-actions bot added the area: Timer Timer label Mar 2, 2021
@pabigot
Copy link
Collaborator Author

pabigot commented Mar 2, 2021

Can you fix the missing blank line in the compliance check?

I didn't introduce or even touch change that line, but yes, I've added a commit that fixes it in the apparent aggregate changes so checkpatch shuts up.

I've also added DNM commit to revert the commit that #32724 suggests is the cause of the CI failures, to see if they go away (they do locally). PR is tagged DNM until that can be removed, but IMO should not be merged until the problem is fixed, as any other PRs that happen to trigger the work queue tests would risk getting CI failures that aren't related to the PR content.

@pabigot
Copy link
Collaborator Author

pabigot commented Mar 2, 2021

Revert submitted as #32790.

tejlmand and others added 13 commits March 2, 2021 14:53
This is needed because Zephyr libraries created after
`find_package(Zephyr)` are not pure Zephyr libraries, as they are not
in the `whole-archive` linking grouping.

Signed-off-by: Torsten Rasmussen <Torsten.Rasmussen@nordicsemi.no>
The work poll API is defined in terms of the k_work API.  Shift a
structure definition around so it's not within the details of a
specific k_work API implementation.

Signed-off-by: Peter Bigot <peter.bigot@nordicsemi.no>
These functions are a subset of proposed public APIs to clean up
several issues related to safely handling waking of threads.  They
have been made private as they interface may change, but their use
will simplify the reimplementation of the k_work functionality.

See: zephyrproject-rtos#29668

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
Signed-off-by: Peter Bigot <peter.bigot@nordicsemi.no>
Attempts to reimplement the existing work API using a new work
implementation failed, primarily due to heavy use of whitebox testing
in validating the original API.  Add a temporary Kconfig that will
select between the two implementations so we can use the same
identifiers but select which implementation they reference.

This commit just adds the selection infrastructure and uses it to
conditionalize the existing implementation in anticipation of the new
one in the next commit.

Signed-off-by: Peter Bigot <peter.bigot@nordicsemi.no>
This commit provides a complete reimplementation of the work queue
infrastructure intended to eliminate the race conditions and feature
gaps in the existing implementation.

Both bare and delayable work structures are supported.  Items can be
submitted; delayable items can be scheduled for submission at a future
time.  Items can be delayed, queued, and running all at the same time.
A running item can also be canceling.

The new implementation:
* replaces "pending" with "busy" which identifies the active states;
* supports canceling delayed and submitted items;
* prevents resubmission of a item being canceled until cancellation
  completes;
* supports waiting for cancellation to complete;
* supports flushing a work item (waiting for the last submission to
  complete without preventing resubmission);
* supports waiting for a queue to drain (only allows resubmission from
  the work thread);
* supports stopping a work queue in conjunction with draining it;
* prevents handler-reentrancy during resubmission.

Signed-off-by: Peter Bigot <peter.bigot@nordicsemi.no>
Revise the description of queues, work items, and delayable work items
to reflect the terminology and API provided by the new implementation.

Signed-off-by: Peter Bigot <peter.bigot@nordicsemi.no>
Covers all lines that can be reached in controlled conditions.

Signed-off-by: Peter Bigot <peter.bigot@nordicsemi.no>
The tcp2 infrastructure is using the legacy delayed work API, and
relies heavily on the transient state indicated by an estimate of
delayed time remaining to determine whether a delayed work item is
still active.  While the wrappers for this work in most cases, one use
is unsanctioned: directly accessing the fields of k_delayed_work
structure to satisfy the calling parameters of the handler when
invoked directly.

The chosen solution for this specific need in the new API is to use a
schedule (rather than reschedule) operation, which leaves any previous
timer unchanged but allows immediate submission if the work is idle.
This changes behavior in that the resend is delegated to the work
queue, rather than done immediately.  The former behavior can be
supported by further refactoring that turns the work handler into a
wrapper around a function that takes a connection reference, and
invoking that here, while the handler invokes it after reconstructing
the connection from the contained work item.

For now put in a hack that also uses the non-public fields of the
delayed work structure to implement the required behavior.  The
complete fix if this solution is used requires replacing all use of
k_delayed_work in this module with k_work_delayable, leveraging the
new functionality of the API to avoid having to guess about the true
state of a work item based on its transient timer or flag states.

Signed-off-by: Peter Bigot <peter.bigot@nordicsemi.no>
Switch the default and clean up some test workarounds.  This will enable
final conversions necessary to transition to the new API.

Signed-off-by: Peter Bigot <peter.bigot@nordicsemi.no>
Replace legacy API with new API.  Note that this driver uses the
schedule, not reschedule, API, since triggers for delay never overlap.

Signed-off-by: Peter Bigot <peter.bigot@nordicsemi.no>
The new API cannot be used from userspace because it is not merely a
wrapper around existing userspace-capable objects (threads and
queues), but instead requires much more complex and lower-level access
to memory that can't be touched from userspace.  The vast majority of
work queue users are operating from privileged mode, so there's little
motivation to go through the pain and complexity of converting all
functions to system calls.

Copy the necessary pieces of the existing userspace work queue API out
and expose them with new names and types:

* k_work_handler_t becomes k_work_user_handler_t
* k_work becomes k_work_user
* k_work_q becomes k_work_user_q

etc.  Because the replacement API cannot use the same types new API
names are also introduced to make it more clear that the userspace
work queue API is a separate functionality.

Signed-off-by: Peter Bigot <peter.bigot@nordicsemi.no>
Now that the old API has been reimplemented with the new API remove
the old implementation and its tests.

Signed-off-by: Peter Bigot <peter.bigot@nordicsemi.no>
checkpatch misinterprets the macro that generates event data
structures as being code rather than more data.  This code has not
been changed, but rearrangement of code around it causes a false
positive when the aggregate changes are checked for style.

Add an extra line to eliminate the warning.

Signed-off-by: Peter Bigot <peter.bigot@nordicsemi.no>
@nashif
Copy link
Member

nashif commented Mar 3, 2021

is the DNM still valid here?

@pabigot pabigot removed the DNM This PR should not be merged (Do Not Merge) label Mar 4, 2021
@pabigot
Copy link
Collaborator Author

pabigot commented Mar 4, 2021

is the DNM still valid here?

No, removed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: API Changes to public APIs area: Base OS Base OS Library (lib/os) area: Bluetooth Mesh area: Bluetooth area: Documentation area: Kernel area: Networking area: Samples Samples area: Tests Issues related to a particular existing or missing test area: Timer Timer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

deep review and redesign of API for work queue functionality