Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xtimer: improve performance #7053

Closed

Conversation

immesys
Copy link
Contributor

@immesys immesys commented May 12, 2017

This PR is @Hyungsin's work, I just cleaned it up and rebased it for pushing upstream.

The PR attempts to minimize the number of times that xtimer_now is called, as this operation is far more expensive than you would expect.

On SAMR21, as part of their energy saving design, Atmel has the timer counter registers on a different clock domain in the chip, it is necessary to perform a synchronized read to get the current value of the counter (e.g here ). The nature of this clock sync is that it takes a few cycles of the clock driving the counter to synchronize. We are using a 32768 hz clock for xtimer as it operates during sleep (OSCULP) and in practice it takes several hundred microseconds to read the timer counter register. This does not appear to be a bug, it is just the cost of reading a register that is on a different clock domain (likely some metastability fix in the silicon).

What this means is that a program that simply wakes up and immediately sleeps takes roughly 2ms to do so. Initially we attributed this to PLL startup etc, but after digging in we found that is actually all negligible, the real cost is reading the timer register six times (four on wakeup, two on setting a new timer when going to sleep). A simple refactor of the xtimer code allows this to happen only three times (two on wakeup, one to set a new timer when going to sleep). Certain xtimer callbacks like unlocking a mutex can be assumed to take negligable time, so we allow them to be flagged as being trivial, so that now after the callback can be assumed to be within XTIMER_ISR_BACKOFF of now before the callback, removing a register read. Other changes include passing now as a parameter to internal functions instead of re-reading from the timer register.

To illustrate, look at this plot of a program just waking up and immediately going to sleep again using xtimer_usleep in a loop.

image

The first thing to note is that the overhead of xtimer_now is not affected by the primary CPU clock frequency, because it is on the timer's clock domain (both 8Mhz and 48Mhz take the same amount of time). This also confirms that the delay is not PLL, because the 8Mhz is a direct RC with no startup delay.

Secondly you can see with this PR the time is reduced from ~1.9ms to ~0.8ms. This is very useful for low power applications wanting to spend as little time as possible in high-power modes, and brings RIOT-OS more in line with Contiki and TinyOS in timer overhead.

@immesys
Copy link
Contributor Author

immesys commented May 12, 2017

I guess good reviewers would be @kaspar030 @gebart

@Hyungsin
Copy link

Hyungsin commented May 13, 2017

Just for comparison,
TinyOS calls 'timer now' twice when waking up through a timer interrupt and twice when falling asleep after setting up a timer.

@smlng
Copy link
Member

smlng commented May 13, 2017

Nice work!

@smlng smlng added Type: enhancement The issue suggests enhanceable parts / The PR enhances parts of the codebase / documentation Area: timers Area: timer subsystems labels May 13, 2017
@smlng smlng requested a review from kaspar030 May 13, 2017 14:05
@lebrush
Copy link
Member

lebrush commented May 16, 2017

@Hyungsin @immesys Very nice work!
One question and one suggestion :-)

mutex_unlock behaves "differently" depending on the mutex status (unlocked, locked(1), locked(1+)). Did you test if it can be considered trivial as well for the last (1+) case? I say it because it could change the current behaviour.

It would be great if you could document in which cases a developer should mark the callback as trivial and when not.

@smlng smlng mentioned this pull request May 18, 2017
@kaspar030
Copy link
Contributor

Nice find!

I've been working on a timer rework (with the aim of making multiple timers, e.g., high speed and RTC, possible). Now I have to go back to the drawing board, to minimize now()-calls, which I assumed to be cheap...

Adding another field to xtimer_t bloats it even more. Having to "annotate" timer ISRs increases complexity. Maybe there's another way.

If I understand it correctly, this is only an issue for getting a low-power timer's value, because of the clock domain difference. We could use a same-clock-domain timer (e.g., systick on cortex-m) to measure the actual time taken by an ISR and use that to deduce "trivial_callback".

@immesys
Copy link
Contributor Author

immesys commented May 18, 2017

Oh that's a nice idea! @Hyungsin we should try that

Copy link
Member

@jnohlgard jnohlgard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the xtimer_drift test application completely breaks down with this PR applied. Both native and frdm-kw41z are broken.
xtimer_drift on native spews output for about 1 second, then just stops. I don't know where the problem is, only that it works without this PR, but not with it.

@immesys
Copy link
Contributor Author

immesys commented Jun 5, 2017 via email


/* time update after executing callback */
if (!timer->trivial_callback) {
now = _xtimer_lltimer_now();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assume you have 10 timers with non trivial callbacks, this will result in 10x _xtimer_lltimer_now() instead of just once with the current implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for instance, keep track (e.g. boolean non-trivial = false) if at least one callback was non-trivial and than simplify below to

if (non-trivial && (reference > _xtimer_lltimer_now()) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Please note that the current implementation calls _xtimer_lltimer_now() in _timer_left(xx) function, which calls _xtimer_lltimer_now() every time regardless of trivial or non-trivial callbacks. In contrast, this PR calls _xtimer_lltimer_now() only for non trivial callbacks.

_xtimer_set_absolute(timer, target);
uint32_t now = _xtimer_now();
uint32_t target = now + offset;
_xtimer_set_absolute(timer, target, now);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if an interrupt or context switch happens between _xtimer_now() and _xtimer_set_absolute, IMHO this would mess up calculations in latter function or not?

Copy link

@Hyungsin Hyungsin Jun 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it. If something happens in the middle, between calling _xtimer_now() and actually using this value, "now" would be outdated and it can mess up the calculation in _xtimer_set_absolute. Is this your point? But on the other hand, I think that "now" can be outdated always wherever it is used... not just here... isn't it?

Copy link
Member

@smlng smlng Jul 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes basically you're right - depends on where/when parts are protected by irq_disable() and irq_restore(state).

@Hyungsin
Copy link

@kaspar030 @lebrush , thank you guys for constructive comments!
I removed trivial_callback and modified the code so that the main clock measures the actual time period for executing _shoot(timer). This time period determines if a callback is trivial or not.

@aabadie aabadie modified the milestone: Release 2017.07 Jun 26, 2017
@immesys
Copy link
Contributor Author

immesys commented Jun 29, 2017

Ok we have changed our approach. The complexity has gone up a bit but it is less hacky.

The synopsis is like this:

  • We can't use high freq timers for xtimer directly because that means operating them during sleep which is unacceptable
  • We can't use only low freq timers for xtimer because they are very slow to query for current value
  • High freq timers are essentially free while the CPU is awake
  • Therefore a hybrid approach where low freq timers are used for timekeeping while asleep and high freq timers are used while awake

This PR is not ready for merge as is (I am sure there are style problems) but it illustrates the approach. Instead of a hacky heuristic like "trivial callback" we actually measure the time the callback takes on a high frequency timer

EDIT: credit for this idea goes to @kaspar030

@Hyungsin
Copy link

Hyungsin commented Jun 29, 2017

The recent commit has also fixed some bugs of the previous version and now becomes more stable.

@kaspar030, your understanding is right indeed. Reading the time info from a fast timer is cheap, while that from a slow timer is expensive. So, we provide a room for user configuration. If xtimer is fed by a fast clock, we can disable this feature by not defining STIMER_DEV. Note that STIMER is a fast timer that helps to improve slow xtimer's performance. When xtimer is slow, we define STIMER_DEV to use it for minimizing xtimer now() calls.

@immesys immesys closed this Oct 27, 2017
@immesys immesys deleted the forupstream_xtimer_perf branch October 27, 2017 22:44
@jnohlgard
Copy link
Member

@immesys Why did you close this?

@immesys immesys restored the forupstream_xtimer_perf branch October 28, 2017 04:58
@immesys
Copy link
Contributor Author

immesys commented Oct 28, 2017

argh. It was auto closed when I accidentally cleaned up a live branch. my bad.

@immesys immesys reopened this Oct 28, 2017
@smlng
Copy link
Member

smlng commented Jan 15, 2018

inconclusive discussion, remove milestone

@smlng smlng removed this from the Release 2018.01 milestone Jan 15, 2018
@immesys
Copy link
Contributor Author

immesys commented Jan 15, 2018

Yes, don't put this in the release, we are still working on it.

@immesys
Copy link
Contributor Author

immesys commented Jan 15, 2018

@Hyungsin can you ensure the forupstream_xtimer_perf branch has the latest improvements, in case this becomes a merge candidate again?

@Hyungsin
Copy link

@immesys, I rebased and updated.
You can just clone the following branch.
https://github.com/Hyungsin/RIOT-OS/tree/forupstream_xtimer_perf

@immesys immesys force-pushed the forupstream_xtimer_perf branch from d14e97a to 7acf521 Compare January 23, 2018 18:04
@Hyungsin
Copy link

Hyungsin commented Jan 23, 2018

@gebart, @kaspar030, @smlng, we updated this PR. Please check.

Again, this PR is to minimize the number of slow clock (e.g., 32 kHz) domain accesses from the fast main clock domain (e.g., 48 MHz), when xtimer is fed by a slow clock. This cross clock domain access takes a long time, which increases energy consumption. Specifically, we aim to avoid
(1) accessing the slow clock's "current time info" and
(2) setting a timer with the slow clock.

1. Cooperative clocking to minimize accessing a different clock domain
When calling xtimer_now(), the slow clock's "current time info" is directly accessed only when the main clock (STIMER) and the slow clock (XTIMER) are not synchronized (i.e., xtimer_sync = false). "xtimer_sync" becomes false whenever the main clock is turned off (i.e., low power mode). If "xtimer_sync" is false, both the main clock and the slow clock are accessed and synchronized, resulting in "xtimer_sync = true". If "xtimer_sync" is true, the slow clock's "current time info" is indirectly given by using a main clock access and frequency conversion.

2. Remove redundant timer setting
To check overflow, the current xtimer implementation sets a default timer which expires at 0xFFFFFFFF whenever timer_list_head is NULL. But if overflow_list_head and long_list_head are NULL also (no long timer), this overflow check is not necessary, only increasing energy consumption. To resolve the problem, this PR sets the overflow-check timer only when overflow_list_head and/or long_list_head are valid.

@@ -104,9 +121,28 @@ static inline uint32_t _xtimer_now(void)
} while (_xtimer_high_cnt != latched_high_cnt);

return latched_high_cnt | now;
#else
#if (XTIMER_HZ < 1000000ul) && (STIMER_HZ >= 1000000ul)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is confusing that the conditional above is around xtimer width (XTIMER_MASK is set for timers less than 32 bits wide), while this conditional has to do with the frequency of the underlying timers.
Also, using #elif saves you the second #endif below.

@jnohlgard
Copy link
Member

I agree with the idea and the intention of this PR, reducing the number of clock domain synchronizations should be reduced if possible. However, the implementation in this PR is a bit messy with lots of preprocessor conditionals strewn all over the xtimer code, even more than before, and I haven't really grasped the whole configuration yet with STIMER_HZ vs XTIMER_HZ etc. (docs are missing on the new functions and macros).

I have also been working on a concept (just ideas, no code), around combining fast timers and slow timers which may be useful for the discussion here. It is too long to write as a comment here, it would get lost in the thread, but I will write up an email to the developers mailing list later in the week or next week. The basic idea is to add a layer between the xtimer implementation and the periph/timer implementation which acts as a single timer but uses two underlying hardware timers, using the low frequency timer only for targets which are more than a few low frequency ticks into the future. By keeping the shim separate from xtimer it will be easier to review and easier to maintain. xtimer can also be simplified a bit if we could get rid of the 16 bit overflow timer list for timers which are less than 32 bits wide.
The synchronization between timers would occur on a scheduled tick of the low frequency timer which is automated by the new layer.

@immesys
Copy link
Contributor Author

immesys commented Jan 24, 2018

@gebart we are willing to do the power profiling of your implementation if you want the help. All of our work has been driven by power consumption, so at times we have placed elegance second. We can probably get this approach cleaner but if there is a more elegant solution at the same power budget then lets go for that rather.

@stale
Copy link

stale bot commented Aug 10, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions.

@stale stale bot added the State: stale State: The issue / PR has no activity for >185 days label Aug 10, 2019
@stale stale bot closed this Sep 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: timers Area: timer subsystems State: stale State: The issue / PR has no activity for >185 days Type: enhancement The issue suggests enhanceable parts / The PR enhances parts of the codebase / documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants