-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kernel: timeout: Ensure that there are no premature timeouts #75885
Conversation
My first impression: does your timer driver have to be that complicated? I've simplified a few timer drivers so far:
None of them exhibits the floating behavior you describe as everything is |
When timer is not periodic but next timer is started from expiration callback of the previous timer then it is expected that there will be aggregated error. Check for aggregated error only if test is running with periodic timers. Signed-off-by: Krzysztof Chruściński <krzysztof.chruscinski@nordicsemi.no>
Adjust test to handle any system clock frequency. Depending on system clock frequency it might not be possible to set exact requested timeout (e.g. 1ms timeout when system timer is 8192 Hz). In that case there would always be a fixed error which may cause test to fail when timer is working correctly. Signed-off-by: Krzysztof Chruściński <krzysztof.chruscinski@nordicsemi.no>
When k_timer_start() is called with relative timeout then it shall never expire before current time + that relative requested timeout. When k_timer_start() is called with absolute timeout then it shall never expire before that absolute timeout is reached. Add test suite which checks if timer do not expire before requested time. Test is checking 3 cases: - starting a timer from thread - starting a timer from timer expiration callback with variable delay before timer is started. - starting a timer from a timer expiration callback with variable delay after timer is started. In all 3 cases it is expected that timer will never expire before requested relative timeout (timeout is relative to k_timer_start call). Same set of scenarios are tested with absolute timers. Signed-off-by: Krzysztof Chruściński <krzysztof.chruscinski@nordicsemi.no>
When 64bit timeouts are used then periodic timer is using absolute timeouts. In that case there were unnecessary steps performed where ticks were subtracted by 1 and then 1 tick was added. Reworking code to fix that. Signed-off-by: Krzysztof Chruściński <krzysztof.chruscinski@nordicsemi.no>
Relative timeouts (default, non-absolute) could expire prematurely if started from timer callback (interrupt handler). That is because they were not scheduled relative to the moment where z_add_timeout is called but relative to the last announcement. It breaks the rule that timeout must never expire earlier than requested (except for periodic timeouts which may expire earlier to avoid aggregating error). Additionally, optimized and rework timeouts. Timeouts scheduling were working in a way that there was no fixed anchor, everything was floating around expected value. When timeout was added it was adjusted by elapsed ticks (elapsed from latest announcemnt) then when scheduled to the driver it was adjust again by elapsed ticks which can be different since some time has passed. Finally, timer driver was requested to schedule timeout which is relative to now (which is also a different point in time). Changed API of the system_timer to work with ticks relative to the last announcement. After this change kernel timeouts and driver use the same anchor (last announcement). It simplifies kernel timeout and driver implementation and fixes early timeouts. Signed-off-by: Krzysztof Chruściński <krzysztof.chruscinski@nordicsemi.no>
Adjust sys_clock_elapsed to return rounded up value. Adjust sys_clock_set_timeout to set timeout value which is relative to the last announcement. Signed-off-by: Krzysztof Chruściński <krzysztof.chruscinski@nordicsemi.no>
Adjust sys_clock_elapsed to return rounded up value. Adjust sys_clock_set_timeout to set timeout value which is relative to the last announcement. Signed-off-by: Krzysztof Chruściński <krzysztof.chruscinski@nordicsemi.no>
Adjust sys_clock_elapsed to return rounded up value. Adjust sys_clock_set_timeout to set timeout value which is relative to the last announcement. Signed-off-by: Krzysztof Chruściński <krzysztof.chruscinski@nordicsemi.no>
@npitre when i tried to run the test that I added here on I see that drivers that you reworked has one less "floating" point because new timeout is set taking last |
This is actually by-design. The circumstances where this happens are a late interrupt (i.e. one that arrives or is still executing more than a tick after it was scheduled), which is a latency failure already. The desire was for a naively-implemented polling loop that did nothing but set a new timeout in the ISR of the previous one would never slip or lose time due to interrupt latency glitches. There's a lot of room for argument there, and blood was spilled when this went in the first time. But I continue to believe it's the best choice. Again, the circumstance you're trying to "fix" is basically broken to begin with, and the use case is pretty reasonable. That said, that policy predates absolute interrupts, which don't have this failure mode. And so maybe they constitute something like an "official API" for latency-glitch-safe timer usage, and maybe we could relax the rule for timer ISR[1] "now" so that it reflects real time and not scheduled time. But my gut says that this is just churn and we're swapping one hard-to-understand policy for another, and gaining nothing. -1 for now, but if people shout loud enough I can probably be made to bend. [1] It's worth pointing out that this interacts with SMP such that code that sets a timeout from another core sees the "frozen/stale" time of the late-arriving[2] interrupt on the other CPU and not what it would calculate itself from k_uptime_get(), which is again not "wrong" but probably surprising. [2] To repeat a third time: you only hit this case when something has broken with timer interrupt delivery and it missed a tick. We're arguing about how best to recover from an already-failed situation. |
Several comments: First, there are too many issues covered at the same time in this PR e.g. Next, in hpet.c you did:
The commit log says: "Adjust sys_clock_elapsed to return rounded up value" but Yet I don't understand the rationale for this change in the first place. You Next, you did "Adjust sys_clock_set_timeout to set timeout value which is Yet this change (that I might like a lot by the way) could have unexpected Next, the "Minor code optimization" is probalby a no-op. The compiler will Next, about the actual test. CAn you explain why you think there is a Looking in kernel/timer.c there is this comment:
In other words, there is an unfortunate backward compatibility wart with What if you get rid of the I personally don't understand why this is there and what would break if Oh, and the test should probably have a .c file of its own, just to follow |
That's not true if system tick frequency is higher. On nordic SoC we have 32kHz tick rate and no new families system clock is running on low power 1MHz clock (currently tick rate is 10kHz but we might increase it 1MHz to get better timer precision, we are running tickless always so we are not limited here). In that circumstances it is normal that handling of the interrupt (especially if more than one timer expires at the same time or higher priority interrupt preempts) will take few system ticks. When system clock runs at 100Hz then it might not be that important when timeout will arrive but when clock has 1us precision then it becomes more visible and imo it is critical that scheduled timer does not arrive earlier than requested (e.g. we are s using timer to shut down high precision high frequency clock used by the radio transmission, we don't want to do that before transmission ends). @npitre
I will check it. Though, this may only remove the need for rounding up in |
Yeah, sorry. I didn't notice before sending my comments.
That's why there is a +1 in
It compensates for that lack of round-up. The queued timeout will be 101 ticks
Right. This is IMHO much better than "elapsed" lying about actual elapsed time. Taking this into account (which is a real problem to be fixed), I still fail |
It is, though; more or less by definition. The tick rate is a promise: the system can handle timing resolution at that frequency. If you blow the deadline and handle the tick late, that's a failure because anyone who registered the timeout with the expectation that their ISR will run on time hasn't received what they wanted. If your hardware/driver/app can't handle a 32768 Hz tick rate without slipping deadlines, then use a 16384 Hz tick rate, etc... It's a design flaw at some level of the stack. Make the promises you can keep, don't design the system around trying to "handle" the promises you couldn't. (FWIW: this is all recapitulating fights we had back when this feature went in. I felt strongly then and I feel strongly now that the "faked back-dated timer" handling is the best choice for a bad situation, and that there are no good solutions to slipped tick ISRs.) |
Just noticed this bit:
This seems confused. Cycle rate and tick rate are not the same thing (though they have been historically on nRF becuase the 32kHz rate was slow enough to make it work). You absolutely can't have a system with a 1MHz tick rate, nothing is going to be able to handle interrupts that fast. Fast-cycle systems in Zephyr have been using ~10kHz for a long time now, and that's a pretty good resolution for general system timing. Subsystems that really need cycle-precise interrupt delivery want to be using counters, not the general timer. |
Stating things differently yet again, in the hopes that one variant will clarify:
So merging your fix breaks case #1: now we can't guarantee regular timer cadences over long periods in the presence of late interrupts. I think that's bad, because that kind of code is very common (as in, we literally have an API for it in k_timer!). Code that you're trying to fix has alternatives like absolute timeouts and counters that it can use instead. |
Why? If tickless mode is used then higher tick rate will only mean that you might have more timer interrupts in case timers expire around the same time (in lower rate they would expire at the same tick). Higher tick rate gives better precision. Of course it may fail if somebody will set periodic timer in ticks (e.g. 5 ticks) not being aware that tick frequency varies between targets.
Exactly, we cannot promise user that ISR will occur exactly on time but we should promise that it will occur not earlier than requested.
No. Let assume that tick rate is 32768 Hz (~30us tick) and we are calling
For that we have periodic timers. Single shot timers shall not care about previous expiration. Issue will occur if you will start the timer2 from expiration handler of timer1. IMO, only periodic timers should ensure that Primary root cause for earlier timeouts is this: Line 80 in 3828c8b
sys_clock_elapsed() is not added to dticks which makes start of timeout relative to tick announcement and not the current moment. I tried to apply sys_clock_elapsed() always but what i don't like there is that sys_clock_elapsed() is applied twice (added z_add_timeout and decremented in next_timeout() ). Due to rounding (up or down) we can still end up earlier than expected (though error shall be within single tick). That's why i proposed this change where sys_clock_elapsed is only applied once (when converting ticks to dticks ) and sys_clock_set_timeout gets dticks instead of ticks (dticks means delta ticks from last announcment, ticks means ticks from now).
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Higher tick rate gives better precision.
Not if you can't guarantee the interrupt delivery is on time!
Yeah, I think I'm a pretty firm -1 on this. You're asking for a radical change to the way Zephyr system timing works, and I don't think that's appropriate. We already have APIs that can provide higher precision for apps that really need it. Kernel timeouts is for general purpose code that doesn't want to deal with nonsense like handling slipped interrupts.
For what it is worth, I am also in agreement with Andy on this. |
Well. the comment just above that line makes it pretty clear that this is Suggestion: you could hddave a function that does this:
Then you just need to invoke your timer with |
This pull request has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this pull request will automatically be closed in 14 days. Note, that you can always re-open a closed pull request at any time. |
I've noticed that it is possible that timer will expire prematurely if it is started from the timer expiration handler. I assume that it is unacceptable as timer must never expire before requested timeout (absolute or relative). The reason for that is that when
z_add_timeout
is called from the expiration handler then requested ticks are not adjusted by ticks elapsed from the latest announcement so essentially it becomes relative to the announcement and not to the moment when user callsk_timer_start
in the handler so timer will expire earlier than expected depending on the distance between announcement andk_timer_start
and it may vary depending on higher interrupt preemption, multiple timers expiring simultaneously or some processing done in the handler before staring next timer.Let me try to explain current status.
Current timeouts are "floating" because they do not have common anchor. They will expire more or less when expected but IMO due to this floating it requires more calculation and can result in premature expiration. It is floating because system_clock API
sys_clock_set_timeout(ticks)
is taking ticks which are relative tonow
. Flow is following:z_add_timeout
is called with relative ticksnow
. Note that elapsed time here and in the previous step may differ. We need to calculate it twice and do some rounding to ticks.now
(may need to round it to the tick boundary).There are 3 different
now
s used for ticks calculation and that is what is meant by "floating" without anchor. The only anchor known to the kernel/timeout and system timer is the last announcement then why not to use it.My proposal is to change
sys_clock_set_timeout(ticks)
to use ticks relative to the last announcement. That simplifies the driver (because it knows when last announcement occurred) and kernel timeout (because ticks calculated duringz_add_timeout
don't need to be adjusted any more).I've added a tests for premature timeouts to
tests/kernel/timer/timer_behavior
.I've also modified test with jitter drift because it was actually validating current behavior by expected that calling a chain of timeouts restarted from the expiration handler will take
n
*timeout
without any aggregated error. This assumption is true for periodic timer where we can expect no aggregated error but when relative timer is started from the expiration handler then there will be a cumulative delay.PR is mainly opened for discussion since it is a change that would require all system clock timers adjustment and currently it is only done for Nordic RTC and GRTC drivers (where it proved to work better by passing all tests and simplified driver implementation).