Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel: timer_api: more tweaks for inexact conversions and poor clock tolerance #25539

Merged
merged 4 commits into from
May 24, 2020

Conversation

pabigot
Copy link
Collaborator

@pabigot pabigot commented May 21, 2020

A couple failures observed in #24001 are a consequence of precision loss converting between milliseconds and 32 KiHz ticks, compounded by clock rate variation. Other failures are due to skew between the clock used for busywait and the clock used for system time. Also asynchronous updates to the tick counter can occur before or after the value it's compared with has been captured.

In short, tweak the test until it runs reliably on problematic hardware identified in #24001 (where "reliably" means >100 consecutive runs without failure using CONFIG_ZTEST_RETEST_IF_PASSED=y).

Fixes #24001
Fixes #24784 through a documentation fix clarifying that k_busy_wait() delays are not necessarily aligned to sysclock.

@zephyrbot zephyrbot added area: Tests Issues related to a particular existing or missing test area: Kernel labels May 21, 2020
@zephyrbot
Copy link
Collaborator

zephyrbot commented May 21, 2020

All checks passed.

Tip: The bot edits this comment instead of posting a new one, so you can check the comment's history to see earlier messages.

@pabigot pabigot force-pushed the nordic/20200521d branch from 78773a9 to 31775e8 Compare May 21, 2020 22:58
@pabigot
Copy link
Collaborator Author

pabigot commented May 22, 2020

This passed 2160 consecutive runs before failing with:

    Assertion failed at ../src/main.c:571: test_timer_remaining: abs((s32_t)(rem_ticks - target_rem_tic
ks)) <= 3 is false

on a board that had a fairly accurate (< 1000 ppm error) HFINT (64 MHz internal oscillator).

On another board with -7200 ppm HFINT error and +61 ppm LFXO error the tick error on that test is -13 (expected tick duration is 1640) consistently. I don't see any point in increasing the threshold higher: some boards will simply be out of tolerance.

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nitpicks. The only issue I feel strongly about is that we can't change documented behavior via a tweak to a test. If the clocks are now allowed to be skewed by policy now, then we should document that in the k_busy_wait() definition.

@@ -18,6 +19,7 @@ struct timer_data {
#define EXPIRE_TIMES 4
#define WITHIN_ERROR(var, target, epsilon) \
(((var) >= (target)) && ((var) <= (target) + (epsilon)))
#define INEXACT_MS_CONVERT (k_ticks_to_ms_ceil64(k_ms_to_ticks_ceil64(1)) != 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a little more robust if you checked that they weren't divisible directly. Something like:

#define INEXACT_MS_CONVERT                       \
    (MAX(CONFIG_SYS_CLOCK_HW_CYCLES_PER_SEC,     \
         CONFIG_SYS_CLOCK_TICKS_PER_SEC) %       \
     MIN(CONFIG_SYS_CLOCK_HW_CYCLES_PER_SEC,     \
         CONFIG_SYS_CLOCK_TICKS_PER_SEC) == 0)   \

Honestly I had to think really hard to figure out if that double conversion was guaranteed to detect the case you want, and I'm still not completely convinced it does for all non-divisible rates.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that does at all the same thing; cycles per second isn't relevant.

We need to know that for 1 ms we get an integral number of ticks. If we get an integral number of ticks, then when we convert that factor back to ms it'll be the same value we started with.

For the original approach think of it as: the ceil operation will increase the value only if there's a remainder for the corresponding scaling. The only way you put in 1, convert, reverse convert, and get 1 back is if neither conversion produced a remainder.

But the original condition was stronger than it needed to be, at least for CYCLES_PER_SEC >= 1000, so the required condition can be expressed more simply which I've done.

I've also tried to explain why this is needed.

TIMER_ASSERT(interval >= DURATION, timer);
TIMER_ASSERT((interval >= DURATION)
|| (INEXACT_MS_CONVERT
&& (interval == DURATION - 1)), timer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inexact version is a tighter bound (looking for an exact interval value) than the old one (make sure it's higher than or equal to duration). That seems wrong. It's OK to add slop where needed, but we should be testing the same behavior on all platforms.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this statement. If the original test passes nothing changes. If it fails, then if we know we might have an error in converting ms to ticks, and in that case we relax the lower bound to account for a tick that wasn't added because accumulated error exceeded a tick.

&periodicity_timer);
TIMER_ASSERT(WITHIN_ERROR(delta, period_ms, 1)
|| (INEXACT_MS_CONVERT
&& (delta == period_ms - 1)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar deal: it used to allow slop, now it doesn't. We should do the same thing with a selectable offset so we don't end up accidentally testing different behavior on different systems.

* or smaller than expected, so relax the tolerance. 3 ticks
* variance has been observed on hardware where the busywait
* clock is 0.7% slow and the 32 KiHz tick clock is 60 ppm
* fast relative to a stable external clock.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... what this does is basically change the Zephyr tested promise that the busy wait and system clock times are compatible. I think if we're going to do this as policy it needs to be spelled out in the docs explicitly somewhere, not just silently waved away in a test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@carlescufi
Copy link
Member

@andyross Perhaps we could discuss the relationship between the system timer and k_busy_wait() also in the context of #24784, and settle this as part of this release. I added a recent comment there:
#24784 (comment)

@andyross
Copy link
Contributor

Honestly I'd be happy with a documentation fix. Right now the clear implication of the way k_timer and k_busy_wait are defined is that they work in the same units. If they're allowed to be different, we should say they can be different (and maybe for extra credit have a framework for characterizing that skew, etc...).

But we can't just patch a test. Real users are not unlikely to make the same assumption the test author did.

@carlescufi
Copy link
Member

Honestly I'd be happy with a documentation fix. Right now the clear implication of the way k_timer and k_busy_wait are defined is that they work in the same units. If they're allowed to be different, we should say they can be different (and maybe for extra credit have a framework for characterizing that skew, etc...).

But we can't just patch a test. Real users are not unlikely to make the same assumption the test author did.

I agree. @pabigot how about proposing a documentation patch in this very PR, and deal with #24784 at the same time?

pabigot added 4 commits May 22, 2020 15:12
Some tests assumed that a delay with k_busy_wait() could be precisely
timed by a tick or cycle count, and vice versa.  This is not true for
all clock configurations on all Zephyr targets, so highlight the
potential variation.

As an example: on some platforms busy-wait is performed by a loop with
known duration in cycles of a known CPU frequency.  The CPU clock may
be unrelated to the clock that drives the system timer.

Signed-off-by: Peter Bigot <peter.bigot@nordicsemi.no>
When one millisecond is not an integral number of ticks measuring
durations between tick events will sometimes be less than expected to
correct for error that was accumulated between other events.  Allow
for that in the duration and period comparisons.

Signed-off-by: Peter Bigot <peter.bigot@nordicsemi.no>
A fast timer clock can advance before or after the remaining time
until an event is captured.  Verify the expected relationship between
current and remaining time holds for at least one captured current
time.

Signed-off-by: Peter Bigot <peter.bigot@nordicsemi.no>
When the timer frequency is not a multiple of 1000 converting between
ticks and milliseconds introduces error.  Avoid propagating the error
by converting divided values rather than dividing converted values.

Also compensate for observed rate differences between the busywait
clock and the timeout clock.

Signed-off-by: Peter Bigot <peter.bigot@nordicsemi.no>
@pabigot pabigot force-pushed the nordic/20200521d branch from 31775e8 to ae90254 Compare May 22, 2020 20:12
@carlescufi carlescufi requested a review from nordic-krch May 24, 2020 18:28
@carlescufi carlescufi merged commit 8fbc83c into zephyrproject-rtos:master May 24, 2020
@pabigot pabigot deleted the nordic/20200521d branch May 26, 2020 12:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Kernel area: Tests Issues related to a particular existing or missing test
Projects
None yet
4 participants