Fix unstable tests involving nRF system timer #35575

anangl · 2021-05-24T13:18:29Z

kernel: timeout: Fix adding of an absolute timeout

Correct the way the relative ticks value is calculated for an absolute
timeout. Previously, elapsed() was called twice and the returned value
was first subtracted from and then added to the ticks value. It could
happen that the HW counter value read by elapsed() changed between the
two calls to this function. This caused the test_timeout_abs test case
from the timer_api test suite to occasionally fail, e.g. on certain nRF
platforms.

drivers: nrf_rtc_timer: Remove unnecessary locking

As per description of the sys_clock_elapsed() function, "the kernel
will call this with appropriate locking, the driver needs only provide
an instantaneous answer". Remove then the unnecessary locking from the
function, as it only adds an undesirable delay.

The above delay combined with the disabled instruction cache (the issue
fixed initially by #35455, then by #35510) caused the test_posix_realtime
test case from the posix_apis test suite to fail on some nRF platforms
(see #35509).

Correct the way the relative ticks value is calculated for an absolute timeout. Previously, elapsed() was called twice and the returned value was first subtracted from and then added to the ticks value. It could happen that the HW counter value read by elapsed() changed between the two calls to this function. This caused the test_timeout_abs test case from the timer_api test suite to occasionally fail, e.g. on certain nRF platforms. Signed-off-by: Andrzej Głąbek <andrzej.glabek@nordicsemi.no>

As per description of the sys_clock_elapsed() function, "the kernel will call this with appropriate locking, the driver needs only provide an instantaneous answer". Remove then the unnecessary locking from the function, as it only adds an undesirable delay. Signed-off-by: Andrzej Głąbek <andrzej.glabek@nordicsemi.no>

andyross

Bug fix looks great. Optimization seems wrong, unless I missed something.

andyross · 2021-05-24T16:20:39Z

kernel/timeout.c

-
-	if (IS_ENABLED(CONFIG_TIMEOUT_64BIT) && Z_TICK_ABS(ticks) >= 0) {
-		ticks = Z_TICK_ABS(timeout.ticks) - (curr_tick + elapsed());
-	}


Good catch. Indeed, this computation should always have happened inside the lock, anyway.

andyross · 2021-05-24T16:29:19Z

drivers/timer/nrf_rtc_timer.c

-	k_spinlock_key_t key = k_spin_lock(&lock);
-	uint32_t ret = counter_sub(counter(), last_count) / CYC_PER_TICK;
-
-	k_spin_unlock(&lock, key);


This looks wrong. That comment isn't saying you don't need synchronization, it's saying that this function isn't reentrant: you don't need to worry about other users of the timeout system asking for the current time simultaneously.

In fact that last_count variable is mutated from an ISR, so if you aren't locking interrupts this can race, e.g.:

Call counter(), which returns a time 1 cycle before the expiration value "T"

The counter advances and the interrupt preempts, then advances last_count to be equal to T

The ISR returns and this subtracts last_count (now "T") from counter()'s result from earlier ("T - 1"), giving 0xffffff

The world blows up due to the overflowed value

Maybe I should remove that sentence in the docs.

andyross · 2021-05-24T19:40:45Z

Ah, but this is a single-CPU system and in fact all existing usage (after the fix in the first patch here) is done with a spinlock held and interrupts masked. I'm not completely sure that's safe to document, but this isn't actually a bug as it stands. Might be worth a comment explaining, and maybe an assertion to check that interrupts are masked, but I don't see why this can't go in. I'll remove my -1.

anangl added bug The issue is a bug, or the PR is fixing a bug area: Timer Timer platform: nRF Nordic nRFx labels May 24, 2021

anangl requested review from andyross, galak and PerMac May 24, 2021 13:18

anangl requested review from dcpleung and nashif as code owners May 24, 2021 13:18

github-actions bot added the area: Kernel label May 24, 2021

anangl added 2 commits May 24, 2021 15:24

anangl force-pushed the fix_absolute_timeouts branch from fe93ff7 to 5221bb7 Compare May 24, 2021 13:25

andyross requested changes May 24, 2021

View reviewed changes

andyross approved these changes May 24, 2021

View reviewed changes

nashif approved these changes May 25, 2021

View reviewed changes

nashif merged commit 457a28b into zephyrproject-rtos:main May 25, 2021

anangl deleted the fix_absolute_timeouts branch May 25, 2021 05:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix unstable tests involving nRF system timer #35575

Fix unstable tests involving nRF system timer #35575

anangl commented May 24, 2021

andyross left a comment

andyross May 24, 2021

andyross May 24, 2021

andyross commented May 24, 2021

Fix unstable tests involving nRF system timer #35575

Fix unstable tests involving nRF system timer #35575

Conversation

anangl commented May 24, 2021

andyross left a comment

Choose a reason for hiding this comment

andyross May 24, 2021

Choose a reason for hiding this comment

andyross May 24, 2021

Choose a reason for hiding this comment

andyross commented May 24, 2021