Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unstable tests involving nRF system timer #35575

Merged
merged 2 commits into from
May 25, 2021

Conversation

anangl
Copy link
Member

@anangl anangl commented May 24, 2021

Fixes #35509.

kernel: timeout: Fix adding of an absolute timeout

Correct the way the relative ticks value is calculated for an absolute
timeout. Previously, elapsed() was called twice and the returned value
was first subtracted from and then added to the ticks value. It could
happen that the HW counter value read by elapsed() changed between the
two calls to this function. This caused the test_timeout_abs test case
from the timer_api test suite to occasionally fail, e.g. on certain nRF
platforms.

drivers: nrf_rtc_timer: Remove unnecessary locking

As per description of the sys_clock_elapsed() function, "the kernel
will call this with appropriate locking, the driver needs only provide
an instantaneous answer". Remove then the unnecessary locking from the
function, as it only adds an undesirable delay.

The above delay combined with the disabled instruction cache (the issue
fixed initially by #35455, then by #35510) caused the test_posix_realtime
test case from the posix_apis test suite to fail on some nRF platforms
(see #35509).

@anangl anangl added bug The issue is a bug, or the PR is fixing a bug area: Timer Timer platform: nRF Nordic nRFx labels May 24, 2021
@anangl anangl requested review from andyross, galak and PerMac May 24, 2021 13:18
@anangl anangl requested review from dcpleung and nashif as code owners May 24, 2021 13:18
anangl added 2 commits May 24, 2021 15:24
Correct the way the relative ticks value is calculated for an absolute
timeout. Previously, elapsed() was called twice and the returned value
was first subtracted from and then added to the ticks value. It could
happen that the HW counter value read by elapsed() changed between the
two calls to this function. This caused the test_timeout_abs test case
from the timer_api test suite to occasionally fail, e.g. on certain nRF
platforms.

Signed-off-by: Andrzej Głąbek <andrzej.glabek@nordicsemi.no>
As per description of the sys_clock_elapsed() function, "the kernel
will call this with appropriate locking, the driver needs only provide
an instantaneous answer". Remove then the unnecessary locking from the
function, as it only adds an undesirable delay.

Signed-off-by: Andrzej Głąbek <andrzej.glabek@nordicsemi.no>
@anangl anangl force-pushed the fix_absolute_timeouts branch from fe93ff7 to 5221bb7 Compare May 24, 2021 13:25
Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug fix looks great. Optimization seems wrong, unless I missed something.


if (IS_ENABLED(CONFIG_TIMEOUT_64BIT) && Z_TICK_ABS(ticks) >= 0) {
ticks = Z_TICK_ABS(timeout.ticks) - (curr_tick + elapsed());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Indeed, this computation should always have happened inside the lock, anyway.

k_spinlock_key_t key = k_spin_lock(&lock);
uint32_t ret = counter_sub(counter(), last_count) / CYC_PER_TICK;

k_spin_unlock(&lock, key);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks wrong. That comment isn't saying you don't need synchronization, it's saying that this function isn't reentrant: you don't need to worry about other users of the timeout system asking for the current time simultaneously.

In fact that last_count variable is mutated from an ISR, so if you aren't locking interrupts this can race, e.g.:

  • Call counter(), which returns a time 1 cycle before the expiration value "T"
  • The counter advances and the interrupt preempts, then advances last_count to be equal to T
  • The ISR returns and this subtracts last_count (now "T") from counter()'s result from earlier ("T - 1"), giving 0xffffff
  • The world blows up due to the overflowed value

Maybe I should remove that sentence in the docs.

@andyross
Copy link
Contributor

Ah, but this is a single-CPU system and in fact all existing usage (after the fix in the first patch here) is done with a spinlock held and interrupts masked. I'm not completely sure that's safe to document, but this isn't actually a bug as it stands. Might be worth a comment explaining, and maybe an assertion to check that interrupts are masked, but I don't see why this can't go in. I'll remove my -1.

@nashif nashif merged commit 457a28b into zephyrproject-rtos:main May 25, 2021
@anangl anangl deleted the fix_absolute_timeouts branch May 25, 2021 05:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Kernel area: Timer Timer bug The issue is a bug, or the PR is fixing a bug platform: nRF Nordic nRFx
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tests: timer: Unstable tests using timer at nrf platforms
3 participants