Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel: timeout: Ensure that there are no premature timeouts #75885

Closed
wants to merge 8 commits into from

Conversation

nordic-krch
Copy link
Contributor

I've noticed that it is possible that timer will expire prematurely if it is started from the timer expiration handler. I assume that it is unacceptable as timer must never expire before requested timeout (absolute or relative). The reason for that is that when z_add_timeout is called from the expiration handler then requested ticks are not adjusted by ticks elapsed from the latest announcement so essentially it becomes relative to the announcement and not to the moment when user calls k_timer_start in the handler so timer will expire earlier than expected depending on the distance between announcement and k_timer_start and it may vary depending on higher interrupt preemption, multiple timers expiring simultaneously or some processing done in the handler before staring next timer.

Let me try to explain current status.
Current timeouts are "floating" because they do not have common anchor. They will expire more or less when expected but IMO due to this floating it requires more calculation and can result in premature expiration. It is floating because system_clock API sys_clock_set_timeout(ticks) is taking ticks which are relative to now. Flow is following:

  • z_add_timeout is called with relative ticks
  • ticks are incremented by the number of ticks elapsed from the latest announcement (to allow comparison with other timeouts). This step is not performed if called from within the expiration handler (due to that timer may expire prematurely).
  • when next pending timeout is scheduled we take this delta ticks value and decrement it by number of ticks elapsed from latest announcement to get ticks relative to now. Note that elapsed time here and in the previous step may differ. We need to calculate it twice and do some rounding to ticks.
  • Timer driver is setting timeout for request ticks from now (may need to round it to the tick boundary).

There are 3 different nows used for ticks calculation and that is what is meant by "floating" without anchor. The only anchor known to the kernel/timeout and system timer is the last announcement then why not to use it.

My proposal is to change sys_clock_set_timeout(ticks) to use ticks relative to the last announcement. That simplifies the driver (because it knows when last announcement occurred) and kernel timeout (because ticks calculated during z_add_timeout don't need to be adjusted any more).

I've added a tests for premature timeouts to tests/kernel/timer/timer_behavior.
I've also modified test with jitter drift because it was actually validating current behavior by expected that calling a chain of timeouts restarted from the expiration handler will take n * timeout without any aggregated error. This assumption is true for periodic timer where we can expect no aggregated error but when relative timer is started from the expiration handler then there will be a cumulative delay.

PR is mainly opened for discussion since it is a change that would require all system clock timers adjustment and currently it is only done for Nordic RTC and GRTC drivers (where it proved to work better by passing all tests and simplified driver implementation).

@nordic-krch nordic-krch added RFC Request For Comments: want input from the community DNM This PR should not be merged (Do Not Merge) labels Jul 15, 2024
@npitre
Copy link
Collaborator

npitre commented Jul 15, 2024

My first impression: does your timer driver have to be that complicated?
It just feels wrong somehow.

I've simplified a few timer drivers so far:

  • apic_timer.c
  • apic_tsc.c
  • arm_arch_timer.c
  • hpet.c
  • riscv_machine_timer.c

None of them exhibits the floating behavior you describe as everything is
based off of the cycle value synchronized to the last announce tick value.
Can you perhaps adopt a similar strategy in your driver?

When timer is not periodic but next timer is started from expiration
callback of the previous timer then it is expected that there will
be aggregated error. Check for aggregated error only if test is running
with periodic timers.

Signed-off-by: Krzysztof Chruściński <krzysztof.chruscinski@nordicsemi.no>
Adjust test to handle any system clock frequency. Depending on system
clock frequency it might not be possible to set exact requested timeout
(e.g. 1ms timeout when system timer is 8192 Hz). In that case there would
always be a fixed error which may cause test to fail when timer is working
correctly.

Signed-off-by: Krzysztof Chruściński <krzysztof.chruscinski@nordicsemi.no>
When k_timer_start() is called with relative timeout then it shall
never expire before current time + that relative requested timeout.
When k_timer_start() is called with absolute timeout then it shall
never expire before that absolute timeout is reached.

Add test suite which checks if timer do not expire before requested
time. Test is checking 3 cases:
- starting a timer from thread
- starting a timer from timer expiration callback with variable delay
  before timer is started.
- starting a timer from a timer expiration callback with variable delay
  after timer is started.
In all 3 cases it is expected that timer will never expire before
requested relative timeout (timeout is relative to k_timer_start call).

Same set of scenarios are tested with absolute timers.

Signed-off-by: Krzysztof Chruściński <krzysztof.chruscinski@nordicsemi.no>
When 64bit timeouts are used then periodic timer is using absolute
timeouts. In that case there were unnecessary steps performed where
ticks were subtracted by 1 and then 1 tick was added. Reworking code to
fix that.

Signed-off-by: Krzysztof Chruściński <krzysztof.chruscinski@nordicsemi.no>
Relative timeouts (default, non-absolute) could expire prematurely if
started from timer callback (interrupt handler). That is because they
were not scheduled relative to the moment where z_add_timeout is called
but relative to the last announcement. It breaks the rule that timeout
must never expire earlier than requested (except for periodic timeouts
which may expire earlier to avoid aggregating error).

Additionally, optimized and rework timeouts. Timeouts scheduling were
working in a way that there was no fixed anchor, everything was floating
around expected value. When timeout was added it was adjusted by
elapsed ticks (elapsed from latest announcemnt) then when scheduled to
the driver it was adjust again by elapsed ticks which can be different
since some time has passed. Finally, timer driver was requested to
schedule timeout which is relative to now (which is also a different
point in time). Changed API of the system_timer to work with ticks
relative to the last announcement. After this change kernel timeouts
and driver use the same anchor (last announcement). It simplifies
kernel timeout and driver implementation and fixes early timeouts.

Signed-off-by: Krzysztof Chruściński <krzysztof.chruscinski@nordicsemi.no>
Adjust sys_clock_elapsed to return rounded up value.
Adjust sys_clock_set_timeout to set timeout value which is
relative to the last announcement.

Signed-off-by: Krzysztof Chruściński <krzysztof.chruscinski@nordicsemi.no>
Adjust sys_clock_elapsed to return rounded up value.
Adjust sys_clock_set_timeout to set timeout value which is
relative to the last announcement.

Signed-off-by: Krzysztof Chruściński <krzysztof.chruscinski@nordicsemi.no>
Adjust sys_clock_elapsed to return rounded up value.
Adjust sys_clock_set_timeout to set timeout value which is
relative to the last announcement.

Signed-off-by: Krzysztof Chruściński <krzysztof.chruscinski@nordicsemi.no>
@nordic-krch
Copy link
Contributor Author

@npitre when i tried to run the test that I added here on qemu_x86 it fails (I run it only with test adding commits and without changes in the kernel timeout). I had to increase ticks per seconds as default is 100 Hz: west build -p -b qemu_x86 tests/kernel/timer/timer_behavior/ -- -DCONFIG_SYS_CLOCK_TICKS_PER_SEC=10000.

I see that drivers that you reworked has one less "floating" point because new timeout is set taking last elapsed value and not current now. When I adapted hpet driver to that change then test is passing.

@andyross
Copy link
Contributor

Relative timeouts (default, non-absolute) could expire prematurely if started from timer callback (interrupt handler). That is because they were not scheduled relative to the moment where z_add_timeout is called but relative to the last announcement. It breaks the rule that timeout must never expire earlier than requested (except for periodic timeouts which may expire earlier to avoid aggregating error).

This is actually by-design. The circumstances where this happens are a late interrupt (i.e. one that arrives or is still executing more than a tick after it was scheduled), which is a latency failure already. The desire was for a naively-implemented polling loop that did nothing but set a new timeout in the ISR of the previous one would never slip or lose time due to interrupt latency glitches.

There's a lot of room for argument there, and blood was spilled when this went in the first time. But I continue to believe it's the best choice. Again, the circumstance you're trying to "fix" is basically broken to begin with, and the use case is pretty reasonable.

That said, that policy predates absolute interrupts, which don't have this failure mode. And so maybe they constitute something like an "official API" for latency-glitch-safe timer usage, and maybe we could relax the rule for timer ISR[1] "now" so that it reflects real time and not scheduled time.

But my gut says that this is just churn and we're swapping one hard-to-understand policy for another, and gaining nothing. -1 for now, but if people shout loud enough I can probably be made to bend.

[1] It's worth pointing out that this interacts with SMP such that code that sets a timeout from another core sees the "frozen/stale" time of the late-arriving[2] interrupt on the other CPU and not what it would calculate itself from k_uptime_get(), which is again not "wrong" but probably surprising.

[2] To repeat a third time: you only hit this case when something has broken with timer interrupt delivery and it missed a tick. We're arguing about how best to recover from an already-failed situation.

@npitre
Copy link
Collaborator

npitre commented Jul 17, 2024

Several comments:

First, there are too many issues covered at the same time in this PR e.g.
the "Do not check drift for non-periodic case", "Align test to clock
capabilities", and "early timeouts" are somewhat unrelated. Would be nicer
to review them separately.

Next, in hpet.c you did:

-       uint64_t now = hpet_counter_get();
-       uint32_t ret = (uint32_t)((now - last_count) / cyc_per_tick);
+       uint64_t elapsed_cyc = hpet_counter_get() - last_count;
+       uint32_t ret = (uint32_t)DIV_ROUND_UP(elapsed_cyc, cyc_per_tick);

The commit log says: "Adjust sys_clock_elapsed to return rounded up value" but
you did a little more than that... like some code styling modifications which
incidentally turned a 32-bit division into a 64-bit one. Divisions, especially
64-bit ones, are costly, even more so when they're unnecessary.

Yet I don't understand the rationale for this change in the first place. You
don't want the system to believe more time has actually elapsed than reality.
But by rounding up the result, sys_clock_elapsed() will actually pretend that
more time than reality has elapsed statistically once every 2 calls. Please
justify that change.

Next, you did "Adjust sys_clock_set_timeout to set timeout value which is
relative to the last announcement." This, too, should be a separate commit.
And if you do such an API change, you ought to adjust all drivers at the
same time or some platforms will be broken within some commit range.

Yet this change (that I might like a lot by the way) could have unexpected
consequences. What if a timer has a large timeout period value, and it is set
after a lot of time has passed since the last announcement. The timeout
argument value would be potentially much larger than it used to be.
Can this be a problem? CAn this e.g. overflow the int32_t argument type?
Maybe nnot. Ideally it would be nice to discuss that in the commit log.

Next, the "Minor code optimization" is probalby a no-op. The compiler will
combine the -1 and +1 and optimize them away I'm sure. So this needs to be
judged on code clarity merits only. Again I would have approved it if it
were a separate PR.

Next, about the actual test. CAn you explain why you think there is a
problem? All your arguments about a lack of "fixed anchor" would rather
delay a timer expiration unnecessarily, not make it sooner.

Looking in kernel/timer.c there is this comment:

        /* z_add_timeout() always adds one to the incoming tick count
         * to round up to the next tick (by convention it waits for
         * "at least as long as the specified timeout"), but the
         * period interval is always guaranteed to be reset from
         * within the timer ISR, so no round up is desired and 1 is
         * subtracted in there.
         *
         * Note that the duration (!) value gets the same treatment
         * for backwards compatibility.  This is unfortunate
         * (i.e. k_timer_start() doesn't treat its initial sleep
         * argument the same way k_sleep() does), but historical.  The
         * timer_api test relies on this behavior.
         */

In other words, there is an unfortunate backward compatibility wart with
the timer duration which is, perhaps, exactly what you're running into.

What if you get rid of the duration.ticks = MAX(duration.ticks - 1, 0);
that immediately follows the comment above? Does your test pass without any
other changes?

I personally don't understand why this is there and what would break if
the duration was left as is without that -1. The backwards compatibility claim
is rather dubious given that timers may trigger at or after the provided
duration. With this wart, timers could trigger up to almost one tick too
early. IMHO this should go.

Oh, and the test should probably have a .c file of its own, just to follow
the existing pattern from the other tests in that directory.

@nordic-krch
Copy link
Contributor Author

nordic-krch commented Jul 17, 2024

@andyross

you only hit this case when something has broken with timer interrupt delivery

That's not true if system tick frequency is higher. On nordic SoC we have 32kHz tick rate and no new families system clock is running on low power 1MHz clock (currently tick rate is 10kHz but we might increase it 1MHz to get better timer precision, we are running tickless always so we are not limited here). In that circumstances it is normal that handling of the interrupt (especially if more than one timer expires at the same time or higher priority interrupt preempts) will take few system ticks.

When system clock runs at 100Hz then it might not be that important when timeout will arrive but when clock has 1us precision then it becomes more visible and imo it is critical that scheduled timer does not arrive earlier than requested (e.g. we are s using timer to shut down high precision high frequency clock used by the radio transmission, we don't want to do that before transmission ends).

@npitre
This is marked as RFC/Draft. I did not adjust all drivers before discussion concludes also things may get split into multiple PRs.
Regarding rounding up elapsed time. Elapsed time in the current approach is used in timeout handling only for getting diff between last announcement and current time. If we want to ensure that timeout never expires prematurely then round up is needed. Let say that we are 0.5 tick after the announcement and 100 tick timeout is scheduled. When rounding down, sys_clock_elapsed will return 0 and will schedule timeout for 100 ticks from last announcement which is 99.5 ticks from now. Timeout will expire prematurely. Rounding up means that it is scheduled 100.5 ticks from now. I am aware that if tick rate is slow (e.g. 100 Hz) then it will usually mean that 100 tick timeout is scheduled for 100.9 ticks from now instead of 99.9 and that may look bad. In my opinion premature timeout is bad always but it is more obvious if tick rate is high.

What if you get rid of the duration.ticks = MAX(duration.ticks - 1, 0);

I will check it. Though, this may only remove the need for rounding up in sys_clock_elapsed.

@npitre
Copy link
Collaborator

npitre commented Jul 17, 2024

This is marked as RFC/Draft.

Yeah, sorry. I didn't notice before sending my comments.

Regarding rounding up elapsed time. Elapsed time in the current approach is
used in timeout handling only for getting diff between last announcement and
current time. If we want to ensure that timeout never expires prematurely
then round up is needed. Let say that we are 0.5 tick after the announcement
and 100 tick timeout is scheduled. When rounding down, sys_clock_elapsed
will return 0 and will schedule timeout for 100 ticks from last announcement
which is 99.5 ticks from now. Timeout will expire prematurely.

That's why there is a +1 in z_add_timeout():

            to->dticks = timeout.ticks + 1 + elapsed();

It compensates for that lack of round-up. The queued timeout will be 101 ticks
from last announcement.

What if you get rid of the duration.ticks = MAX(duration.ticks - 1, 0);

I will check it. Though, this may only remove the need for rounding up in
sys_clock_elapsed.

Right. This is IMHO much better than "elapsed" lying about actual elapsed time.

Taking this into account (which is a real problem to be fixed), I still fail
to see how timers can fire early otherwise.

@andyross
Copy link
Contributor

That's not true if system tick frequency is higher.

It is, though; more or less by definition. The tick rate is a promise: the system can handle timing resolution at that frequency. If you blow the deadline and handle the tick late, that's a failure because anyone who registered the timeout with the expectation that their ISR will run on time hasn't received what they wanted.

If your hardware/driver/app can't handle a 32768 Hz tick rate without slipping deadlines, then use a 16384 Hz tick rate, etc... It's a design flaw at some level of the stack. Make the promises you can keep, don't design the system around trying to "handle" the promises you couldn't.

(FWIW: this is all recapitulating fights we had back when this feature went in. I felt strongly then and I feel strongly now that the "faked back-dated timer" handling is the best choice for a bad situation, and that there are no good solutions to slipped tick ISRs.)

@andyross
Copy link
Contributor

Just noticed this bit:

When system clock runs at 100Hz then it might not be that important when timeout will arrive but when clock has 1us precision then it becomes more visible and imo it is critical that scheduled timer does not arrive earlier than requested

This seems confused. Cycle rate and tick rate are not the same thing (though they have been historically on nRF becuase the 32kHz rate was slow enough to make it work). You absolutely can't have a system with a 1MHz tick rate, nothing is going to be able to handle interrupts that fast. Fast-cycle systems in Zephyr have been using ~10kHz for a long time now, and that's a pretty good resolution for general system timing. Subsystems that really need cycle-precise interrupt delivery want to be using counters, not the general timer.

@andyross
Copy link
Contributor

Stating things differently yet again, in the hopes that one variant will clarify:

  1. All this only matters in the case where we have a late/slipped timer ISR. If we meet our deadlines, the two designs work the same.

  2. Right now, we have a design that prioritizes the needs of a periodic k_timer (or similar code) that wants to schedule regular interrupts (for sensor sampling, say). We guarantee that even if we have a late interrupt, the cadence of the interrupts does not change over time and that if we come back in 73 days expecting exactly 4M samples, we'll discover we took 4M samples exactly.

  3. Your code prioritizes the need of timer-driven code that needs to set a new timer for a future time, unrelated to the scheduled time of the current callback, and needs it to be relative to the "true" real time moment as reported by the hardware.

So merging your fix breaks case #1: now we can't guarantee regular timer cadences over long periods in the presence of late interrupts. I think that's bad, because that kind of code is very common (as in, we literally have an API for it in k_timer!). Code that you're trying to fix has alternatives like absolute timeouts and counters that it can use instead.

@nordic-krch
Copy link
Contributor Author

@andyross

You absolutely can't have a system with a 1MHz tick rate

Why? If tickless mode is used then higher tick rate will only mean that you might have more timer interrupts in case timers expire around the same time (in lower rate they would expire at the same tick). Higher tick rate gives better precision. Of course it may fail if somebody will set periodic timer in ticks (e.g. 5 ticks) not being aware that tick frequency varies between targets.

If you blow the deadline and handle the tick late

Exactly, we cannot promise user that ISR will occur exactly on time but we should promise that it will occur not earlier than requested.

All this only matters in the case where we have a late/slipped timer ISR. If we meet our deadlines, the two designs work the same.

No. Let assume that tick rate is 32768 Hz (~30us tick) and we are calling k_timer_start(t, K_TICKS(100), K_NO_WAIT) from the context of timer expiration handler.

  1. z_add_timeout is called.
    to->dticks = timeout.ticks + 1 + elapsed();
    where dticks is set. We can ignore +1 which is compensated by -1 in k_timer_start. elapsed() is using sys_clock_elapsed() if announce_remaining == 0 which is true if we are outside of sys_clock_announce. That's not the case here so dticks=100. dticks is delta ticks aligned to the latest announcement which occurred just now but if we were preempted or processing took more than 30us (quite reasonable to assume that) then our 100 tick from now won't be met.
  2. Timeout is added to queue and not started from z_add_timeout because announce_remaining != 0
    if (to == first() && announce_remaining == 0) {
    it would only be started here if we are not in sys_clock_announce().
  3. sys_clock_announce() completes and sets first pending timeout
    sys_clock_set_timeout(next_timeout(), false);
  4. next_timeout takes dticks (100) and decrement by elapsed(). announce_remaining is already 0 so sys_clock_elapsed() is called. Let say that 2 ticks (60us) already passed so sys_clock_set_timeout() is called with 98 ticks from now. The time difference between user call to k_timer_start and calling sys_clock_set_timeout() can be short (e.g. 0.5tick) if it was the last timeout that expired on the given tick and nothing preempted in between so user will get timeout after 98.5 ticks when 100 ticks were requested.

Right now, we have a design that prioritizes the needs of a periodic k_timer (or similar code) that wants to schedule regular interrupts

For that we have periodic timers. Single shot timers shall not care about previous expiration. Issue will occur if you will start the timer2 from expiration handler of timer1. IMO, only periodic timers should ensure that n expiration handlers will be called in n * period.

Primary root cause for earlier timeouts is this:

return announce_remaining == 0 ? sys_clock_elapsed() : 0U;
because sys_clock_elapsed() is not added to dticks which makes start of timeout relative to tick announcement and not the current moment. I tried to apply sys_clock_elapsed() always but what i don't like there is that sys_clock_elapsed() is applied twice (added z_add_timeout and decremented in next_timeout()). Due to rounding (up or down) we can still end up earlier than expected (though error shall be within single tick). That's why i proposed this change where sys_clock_elapsed is only applied once (when converting ticks to dticks) and sys_clock_set_timeout gets dticks instead of ticks (dticks means delta ticks from last announcment, ticks means ticks from now).

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Higher tick rate gives better precision.

Not if you can't guarantee the interrupt delivery is on time!

Yeah, I think I'm a pretty firm -1 on this. You're asking for a radical change to the way Zephyr system timing works, and I don't think that's appropriate. We already have APIs that can provide higher precision for apps that really need it. Kernel timeouts is for general purpose code that doesn't want to deal with nonsense like handling slipped interrupts.

@peter-mitsis
Copy link
Collaborator

For what it is worth, I am also in agreement with Andy on this.
Although we try to have each tick be uniform in length, we can not guarantee it--there is always some deviation or jitter from one source or another. In what is being described, I view the timer ISR itself as a source of jitter—though a rather extreme one when it consumes more than one tick. In that context, ticks may be out of alignment with respect to the expected points in wall time for a short duration, but the current algorithm auto-corrects soon enough while maintaining event ordering.

@npitre
Copy link
Collaborator

npitre commented Jul 19, 2024

Primary root cause for earlier timeouts is this:

return announce_remaining == 0 ? sys_clock_elapsed() : 0U;

because sys_clock_elapsed() is not added to dticks which makes start
of timeout relative to tick announcement and not the current moment.

Well. the comment just above that line makes it pretty clear that this is
actually the wanted behavior whereas you consider it as a bug. "Fixing" it
for you will certainly break others for whom the current behavior was worth
commenting.

Suggestion: you could hddave a function that does this:

#ifdef CONFIG_TIMEOUT_64BIT
k_timeout_t sys_really_absolute_timeout(k_timeout_t timeout)
{
        int64_t t;

        if (!K_TIMEOUT_EQ(timeout, K_FOREVER) &&
            !K_TIMEOUT_EQ(timeout, K_NO_WAIT) &&
            Z_TICK_ABS(timeout.ticks) < 0) {
                K_SPINLOCK(&timeout_lock) {
                        t = curr_tick + sys_clock_elapsed();
                }
                timeout.ticks = Z_TICK_ABS(t + timeout.ticks);
        }
        return timeout;
}
#endif

Then you just need to invoke your timer with
k_timer_start(&timer, sys_really_absolute_timeout(timeout), 0).
Problem solved without altering current behavior for existing users.

Copy link

This pull request has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this pull request will automatically be closed in 14 days. Note, that you can always re-open a closed pull request at any time.

@github-actions github-actions bot added the Stale label Sep 18, 2024
@github-actions github-actions bot closed this Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DNM This PR should not be merged (Do Not Merge) RFC Request For Comments: want input from the community Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants