xtimer: improve performance #7053

immesys · 2017-05-12T23:34:15Z

This PR is @Hyungsin's work, I just cleaned it up and rebased it for pushing upstream.

The PR attempts to minimize the number of times that xtimer_now is called, as this operation is far more expensive than you would expect.

On SAMR21, as part of their energy saving design, Atmel has the timer counter registers on a different clock domain in the chip, it is necessary to perform a synchronized read to get the current value of the counter (e.g here ). The nature of this clock sync is that it takes a few cycles of the clock driving the counter to synchronize. We are using a 32768 hz clock for xtimer as it operates during sleep (OSCULP) and in practice it takes several hundred microseconds to read the timer counter register. This does not appear to be a bug, it is just the cost of reading a register that is on a different clock domain (likely some metastability fix in the silicon).

What this means is that a program that simply wakes up and immediately sleeps takes roughly 2ms to do so. Initially we attributed this to PLL startup etc, but after digging in we found that is actually all negligible, the real cost is reading the timer register six times (four on wakeup, two on setting a new timer when going to sleep). A simple refactor of the xtimer code allows this to happen only three times (two on wakeup, one to set a new timer when going to sleep). Certain xtimer callbacks like unlocking a mutex can be assumed to take negligable time, so we allow them to be flagged as being trivial, so that now after the callback can be assumed to be within XTIMER_ISR_BACKOFF of now before the callback, removing a register read. Other changes include passing now as a parameter to internal functions instead of re-reading from the timer register.

To illustrate, look at this plot of a program just waking up and immediately going to sleep again using xtimer_usleep in a loop.

The first thing to note is that the overhead of xtimer_now is not affected by the primary CPU clock frequency, because it is on the timer's clock domain (both 8Mhz and 48Mhz take the same amount of time). This also confirms that the delay is not PLL, because the 8Mhz is a direct RC with no startup delay.

Secondly you can see with this PR the time is reduced from ~1.9ms to ~0.8ms. This is very useful for low power applications wanting to spend as little time as possible in high-power modes, and brings RIOT-OS more in line with Contiki and TinyOS in timer overhead.

immesys · 2017-05-12T23:36:30Z

I guess good reviewers would be @kaspar030 @gebart

Hyungsin · 2017-05-13T00:18:30Z

Just for comparison,
TinyOS calls 'timer now' twice when waking up through a timer interrupt and twice when falling asleep after setting up a timer.

smlng · 2017-05-13T14:04:20Z

Nice work!

lebrush · 2017-05-16T06:59:41Z

@Hyungsin @immesys Very nice work!
One question and one suggestion :-)

mutex_unlock behaves "differently" depending on the mutex status (unlocked, locked(1), locked(1+)). Did you test if it can be considered trivial as well for the last (1+) case? I say it because it could change the current behaviour.

It would be great if you could document in which cases a developer should mark the callback as trivial and when not.

kaspar030 · 2017-05-18T09:02:46Z

Nice find!

I've been working on a timer rework (with the aim of making multiple timers, e.g., high speed and RTC, possible). Now I have to go back to the drawing board, to minimize now()-calls, which I assumed to be cheap...

Adding another field to xtimer_t bloats it even more. Having to "annotate" timer ISRs increases complexity. Maybe there's another way.

If I understand it correctly, this is only an issue for getting a low-power timer's value, because of the clock domain difference. We could use a same-clock-domain timer (e.g., systick on cortex-m) to measure the actual time taken by an ISR and use that to deduce "trivial_callback".

immesys · 2017-05-18T15:07:33Z

Oh that's a nice idea! @Hyungsin we should try that

jnohlgard

the xtimer_drift test application completely breaks down with this PR applied. Both native and frdm-kw41z are broken.
xtimer_drift on native spews output for about 1 second, then just stops. I don't know where the problem is, only that it works without this PR, but not with it.

immesys · 2017-06-05T01:15:44Z

Interesting, we can investigate

…

On Jun 4, 2017 6:08 PM, "Joakim Nohlgård" ***@***.***> wrote: ***@***.**** requested changes on this pull request. the xtimer_drift test application completely breaks down with this PR applied. Both native and frdm-kw41z are broken. xtimer_drift on native spews output for about 1 second, then just stops. I don't know where the problem is, only that it works without this PR, but not with it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7053 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AECbk0jS_X6heIjTWHBqDsB7rmP2zGIQks5sA1TwgaJpZM4NZ4fY> .

smlng · 2017-06-07T14:16:09Z

sys/xtimer/xtimer_core.c

+
+        /* time update after executing callback */
+        if (!timer->trivial_callback) {
+          now = _xtimer_lltimer_now();


assume you have 10 timers with non trivial callbacks, this will result in 10x _xtimer_lltimer_now() instead of just once with the current implementation.

for instance, keep track (e.g. boolean non-trivial = false) if at least one callback was non-trivial and than simplify below to

if (non-trivial && (reference > _xtimer_lltimer_now()) {

I don't think so. Please note that the current implementation calls _xtimer_lltimer_now() in _timer_left(xx) function, which calls _xtimer_lltimer_now() every time regardless of trivial or non-trivial callbacks. In contrast, this PR calls _xtimer_lltimer_now() only for non trivial callbacks.

smlng · 2017-06-07T14:25:14Z

sys/xtimer/xtimer_core.c

-        _xtimer_set_absolute(timer, target);
+        uint32_t now = _xtimer_now();
+        uint32_t target = now + offset;
+        _xtimer_set_absolute(timer, target, now);


what if an interrupt or context switch happens between _xtimer_now() and _xtimer_set_absolute, IMHO this would mess up calculations in latter function or not?

I got it. If something happens in the middle, between calling _xtimer_now() and actually using this value, "now" would be outdated and it can mess up the calculation in _xtimer_set_absolute. Is this your point? But on the other hand, I think that "now" can be outdated always wherever it is used... not just here... isn't it?

yes basically you're right - depends on where/when parts are protected by irq_disable() and irq_restore(state).

Hyungsin · 2017-06-23T04:08:06Z

@kaspar030 @lebrush , thank you guys for constructive comments!
I removed trivial_callback and modified the code so that the main clock measures the actual time period for executing _shoot(timer). This time period determines if a callback is trivial or not.

immesys · 2017-06-29T19:04:00Z

Ok we have changed our approach. The complexity has gone up a bit but it is less hacky.

The synopsis is like this:

We can't use high freq timers for xtimer directly because that means operating them during sleep which is unacceptable
We can't use only low freq timers for xtimer because they are very slow to query for current value
High freq timers are essentially free while the CPU is awake
Therefore a hybrid approach where low freq timers are used for timekeeping while asleep and high freq timers are used while awake

This PR is not ready for merge as is (I am sure there are style problems) but it illustrates the approach. Instead of a hacky heuristic like "trivial callback" we actually measure the time the callback takes on a high frequency timer

EDIT: credit for this idea goes to @kaspar030

Hyungsin · 2017-06-29T19:10:41Z

The recent commit has also fixed some bugs of the previous version and now becomes more stable.

@kaspar030, your understanding is right indeed. Reading the time info from a fast timer is cheap, while that from a slow timer is expensive. So, we provide a room for user configuration. If xtimer is fed by a fast clock, we can disable this feature by not defining STIMER_DEV. Note that STIMER is a fast timer that helps to improve slow xtimer's performance. When xtimer is slow, we define STIMER_DEV to use it for minimizing xtimer now() calls.

jnohlgard · 2017-10-28T04:55:33Z

@immesys Why did you close this?

immesys · 2017-10-28T04:58:45Z

argh. It was auto closed when I accidentally cleaned up a live branch. my bad.

smlng · 2018-01-15T21:58:48Z

inconclusive discussion, remove milestone

immesys · 2018-01-15T21:59:41Z

Yes, don't put this in the release, we are still working on it.

immesys · 2018-01-15T22:27:28Z

@Hyungsin can you ensure the forupstream_xtimer_perf branch has the latest improvements, in case this becomes a merge candidate again?

Hyungsin · 2018-01-17T08:24:17Z

@immesys, I rebased and updated.
You can just clone the following branch.
https://github.com/Hyungsin/RIOT-OS/tree/forupstream_xtimer_perf

Hyungsin · 2018-01-23T18:38:14Z

@gebart, @kaspar030, @smlng, we updated this PR. Please check.

Again, this PR is to minimize the number of slow clock (e.g., 32 kHz) domain accesses from the fast main clock domain (e.g., 48 MHz), when xtimer is fed by a slow clock. This cross clock domain access takes a long time, which increases energy consumption. Specifically, we aim to avoid
(1) accessing the slow clock's "current time info" and
(2) setting a timer with the slow clock.

1. Cooperative clocking to minimize accessing a different clock domain
When calling xtimer_now(), the slow clock's "current time info" is directly accessed only when the main clock (STIMER) and the slow clock (XTIMER) are not synchronized (i.e., xtimer_sync = false). "xtimer_sync" becomes false whenever the main clock is turned off (i.e., low power mode). If "xtimer_sync" is false, both the main clock and the slow clock are accessed and synchronized, resulting in "xtimer_sync = true". If "xtimer_sync" is true, the slow clock's "current time info" is indirectly given by using a main clock access and frequency conversion.

2. Remove redundant timer setting
To check overflow, the current xtimer implementation sets a default timer which expires at 0xFFFFFFFF whenever timer_list_head is NULL. But if overflow_list_head and long_list_head are NULL also (no long timer), this overflow check is not necessary, only increasing energy consumption. To resolve the problem, this PR sets the overflow-check timer only when overflow_list_head and/or long_list_head are valid.

jnohlgard · 2018-01-24T09:56:54Z

sys/include/xtimer/implementation.h

@@ -104,9 +121,28 @@ static inline uint32_t _xtimer_now(void)
    } while (_xtimer_high_cnt != latched_high_cnt);

    return latched_high_cnt | now;
+#else
+#if (XTIMER_HZ < 1000000ul) && (STIMER_HZ >= 1000000ul)


it is confusing that the conditional above is around xtimer width (XTIMER_MASK is set for timers less than 32 bits wide), while this conditional has to do with the frequency of the underlying timers.
Also, using #elif saves you the second #endif below.

jnohlgard · 2018-01-24T10:26:20Z

I agree with the idea and the intention of this PR, reducing the number of clock domain synchronizations should be reduced if possible. However, the implementation in this PR is a bit messy with lots of preprocessor conditionals strewn all over the xtimer code, even more than before, and I haven't really grasped the whole configuration yet with STIMER_HZ vs XTIMER_HZ etc. (docs are missing on the new functions and macros).

I have also been working on a concept (just ideas, no code), around combining fast timers and slow timers which may be useful for the discussion here. It is too long to write as a comment here, it would get lost in the thread, but I will write up an email to the developers mailing list later in the week or next week. The basic idea is to add a layer between the xtimer implementation and the periph/timer implementation which acts as a single timer but uses two underlying hardware timers, using the low frequency timer only for targets which are more than a few low frequency ticks into the future. By keeping the shim separate from xtimer it will be easier to review and easier to maintain. xtimer can also be simplified a bit if we could get rid of the 16 bit overflow timer list for timers which are less than 32 bits wide.
The synchronization between timers would occur on a scheduled tick of the low frequency timer which is automated by the new layer.

immesys · 2018-01-24T17:54:54Z

@gebart we are willing to do the power profiling of your implementation if you want the help. All of our work has been driven by power consumption, so at times we have placed elegance second. We can probably get this approach cleaner but if there is a more elegant solution at the same power budget then lets go for that rather.

stale · 2019-08-10T07:07:57Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions.

smlng added Type: enhancement The issue suggests enhanceable parts / The PR enhances parts of the codebase / documentation Area: timers Area: timer subsystems labels May 13, 2017

smlng requested a review from kaspar030 May 13, 2017 14:05

smlng mentioned this pull request May 18, 2017

ps: fix schedstatistics #6975

Merged

miri64 assigned kaspar030 May 30, 2017

jnohlgard requested changes Jun 5, 2017

View reviewed changes

smlng reviewed Jun 7, 2017

View reviewed changes

aabadie modified the milestone: Release 2017.07 Jun 26, 2017

aabadie modified the milestones: Release 2017.07, Release 2017.10 Jun 30, 2017

immesys closed this Oct 27, 2017

immesys deleted the forupstream_xtimer_perf branch October 27, 2017 22:44

immesys restored the forupstream_xtimer_perf branch October 28, 2017 04:58

immesys reopened this Oct 28, 2017

smlng modified the milestones: Release 2017.10, Release 2018.01 Nov 16, 2017

smlng removed this from the Release 2018.01 milestone Jan 15, 2018

sys/xtimer: add cooperative clocking feature

7acf521

immesys force-pushed the forupstream_xtimer_perf branch from d14e97a to 7acf521 Compare January 23, 2018 18:04

jnohlgard reviewed Jan 24, 2018

View reviewed changes

Hyungsin mentioned this pull request Aug 9, 2018

sys: xtimer concurrency/robustness improvement #9530

Merged

stale bot added the State: stale State: The issue / PR has no activity for >185 days label Aug 10, 2019

stale bot closed this Sep 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xtimer: improve performance #7053

xtimer: improve performance #7053

immesys commented May 12, 2017

immesys commented May 12, 2017

Hyungsin commented May 13, 2017 •

edited

Loading

smlng commented May 13, 2017

lebrush commented May 16, 2017

kaspar030 commented May 18, 2017

immesys commented May 18, 2017

jnohlgard left a comment

immesys commented Jun 5, 2017 via email

smlng Jun 7, 2017

smlng Jun 7, 2017

Hyungsin Jun 23, 2017

smlng Jun 7, 2017

Hyungsin Jun 23, 2017 •

edited

Loading

smlng Jul 13, 2017 •

edited

Loading

Hyungsin commented Jun 23, 2017

immesys commented Jun 29, 2017 •

edited

Loading

Hyungsin commented Jun 29, 2017 •

edited

Loading

jnohlgard commented Oct 28, 2017

immesys commented Oct 28, 2017

smlng commented Jan 15, 2018 •

edited

Loading

immesys commented Jan 15, 2018

immesys commented Jan 15, 2018

Hyungsin commented Jan 17, 2018

Hyungsin commented Jan 23, 2018 •

edited

Loading

jnohlgard Jan 24, 2018

jnohlgard commented Jan 24, 2018

immesys commented Jan 24, 2018

stale bot commented Aug 10, 2019

xtimer: improve performance #7053

xtimer: improve performance #7053

Conversation

immesys commented May 12, 2017

immesys commented May 12, 2017

Hyungsin commented May 13, 2017 • edited Loading

smlng commented May 13, 2017

lebrush commented May 16, 2017

kaspar030 commented May 18, 2017

immesys commented May 18, 2017

jnohlgard left a comment

Choose a reason for hiding this comment

immesys commented Jun 5, 2017 via email

smlng Jun 7, 2017

Choose a reason for hiding this comment

smlng Jun 7, 2017

Choose a reason for hiding this comment

Hyungsin Jun 23, 2017

Choose a reason for hiding this comment

smlng Jun 7, 2017

Choose a reason for hiding this comment

Hyungsin Jun 23, 2017 • edited Loading

Choose a reason for hiding this comment

smlng Jul 13, 2017 • edited Loading

Choose a reason for hiding this comment

Hyungsin commented Jun 23, 2017

immesys commented Jun 29, 2017 • edited Loading

Hyungsin commented Jun 29, 2017 • edited Loading

jnohlgard commented Oct 28, 2017

immesys commented Oct 28, 2017

smlng commented Jan 15, 2018 • edited Loading

immesys commented Jan 15, 2018

immesys commented Jan 15, 2018

Hyungsin commented Jan 17, 2018

Hyungsin commented Jan 23, 2018 • edited Loading

jnohlgard Jan 24, 2018

Choose a reason for hiding this comment

jnohlgard commented Jan 24, 2018

immesys commented Jan 24, 2018

stale bot commented Aug 10, 2019

Hyungsin commented May 13, 2017 •

edited

Loading

Hyungsin Jun 23, 2017 •

edited

Loading

smlng Jul 13, 2017 •

edited

Loading

immesys commented Jun 29, 2017 •

edited

Loading

Hyungsin commented Jun 29, 2017 •

edited

Loading

smlng commented Jan 15, 2018 •

edited

Loading

Hyungsin commented Jan 23, 2018 •

edited

Loading