-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xtimer: improve performance #7053
Conversation
I guess good reviewers would be @kaspar030 @gebart |
Just for comparison, |
Nice work! |
@Hyungsin @immesys Very nice work!
It would be great if you could document in which cases a developer should mark the callback as trivial and when not. |
Nice find! I've been working on a timer rework (with the aim of making multiple timers, e.g., high speed and RTC, possible). Now I have to go back to the drawing board, to minimize now()-calls, which I assumed to be cheap... Adding another field to xtimer_t bloats it even more. Having to "annotate" timer ISRs increases complexity. Maybe there's another way. If I understand it correctly, this is only an issue for getting a low-power timer's value, because of the clock domain difference. We could use a same-clock-domain timer (e.g., systick on cortex-m) to measure the actual time taken by an ISR and use that to deduce "trivial_callback". |
Oh that's a nice idea! @Hyungsin we should try that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the xtimer_drift test application completely breaks down with this PR applied. Both native and frdm-kw41z are broken.
xtimer_drift on native spews output for about 1 second, then just stops. I don't know where the problem is, only that it works without this PR, but not with it.
Interesting, we can investigate
…On Jun 4, 2017 6:08 PM, "Joakim Nohlgård" ***@***.***> wrote:
***@***.**** requested changes on this pull request.
the xtimer_drift test application completely breaks down with this PR
applied. Both native and frdm-kw41z are broken.
xtimer_drift on native spews output for about 1 second, then just stops. I
don't know where the problem is, only that it works without this PR, but
not with it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7053 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AECbk0jS_X6heIjTWHBqDsB7rmP2zGIQks5sA1TwgaJpZM4NZ4fY>
.
|
sys/xtimer/xtimer_core.c
Outdated
|
||
/* time update after executing callback */ | ||
if (!timer->trivial_callback) { | ||
now = _xtimer_lltimer_now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assume you have 10 timers with non trivial callbacks, this will result in 10x _xtimer_lltimer_now()
instead of just once with the current implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for instance, keep track (e.g. boolean non-trivial = false
) if at least one callback was non-trivial and than simplify below to
if (non-trivial && (reference > _xtimer_lltimer_now()) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. Please note that the current implementation calls _xtimer_lltimer_now() in _timer_left(xx) function, which calls _xtimer_lltimer_now() every time regardless of trivial or non-trivial callbacks. In contrast, this PR calls _xtimer_lltimer_now() only for non trivial callbacks.
_xtimer_set_absolute(timer, target); | ||
uint32_t now = _xtimer_now(); | ||
uint32_t target = now + offset; | ||
_xtimer_set_absolute(timer, target, now); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if an interrupt or context switch happens between _xtimer_now()
and _xtimer_set_absolute
, IMHO this would mess up calculations in latter function or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got it. If something happens in the middle, between calling _xtimer_now() and actually using this value, "now" would be outdated and it can mess up the calculation in _xtimer_set_absolute. Is this your point? But on the other hand, I think that "now" can be outdated always wherever it is used... not just here... isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes basically you're right - depends on where/when parts are protected by irq_disable()
and irq_restore(state)
.
@kaspar030 @lebrush , thank you guys for constructive comments! |
Ok we have changed our approach. The complexity has gone up a bit but it is less hacky. The synopsis is like this:
This PR is not ready for merge as is (I am sure there are style problems) but it illustrates the approach. Instead of a hacky heuristic like "trivial callback" we actually measure the time the callback takes on a high frequency timer EDIT: credit for this idea goes to @kaspar030 |
The recent commit has also fixed some bugs of the previous version and now becomes more stable. @kaspar030, your understanding is right indeed. Reading the time info from a fast timer is cheap, while that from a slow timer is expensive. So, we provide a room for user configuration. If xtimer is fed by a fast clock, we can disable this feature by not defining STIMER_DEV. Note that STIMER is a fast timer that helps to improve slow xtimer's performance. When xtimer is slow, we define STIMER_DEV to use it for minimizing xtimer now() calls. |
@immesys Why did you close this? |
argh. It was auto closed when I accidentally cleaned up a live branch. my bad. |
inconclusive discussion, remove milestone |
Yes, don't put this in the release, we are still working on it. |
@Hyungsin can you ensure the forupstream_xtimer_perf branch has the latest improvements, in case this becomes a merge candidate again? |
@immesys, I rebased and updated. |
d14e97a
to
7acf521
Compare
@gebart, @kaspar030, @smlng, we updated this PR. Please check. Again, this PR is to minimize the number of slow clock (e.g., 32 kHz) domain accesses from the fast main clock domain (e.g., 48 MHz), when xtimer is fed by a slow clock. This cross clock domain access takes a long time, which increases energy consumption. Specifically, we aim to avoid 1. Cooperative clocking to minimize accessing a different clock domain 2. Remove redundant timer setting |
@@ -104,9 +121,28 @@ static inline uint32_t _xtimer_now(void) | |||
} while (_xtimer_high_cnt != latched_high_cnt); | |||
|
|||
return latched_high_cnt | now; | |||
#else | |||
#if (XTIMER_HZ < 1000000ul) && (STIMER_HZ >= 1000000ul) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is confusing that the conditional above is around xtimer width (XTIMER_MASK is set for timers less than 32 bits wide), while this conditional has to do with the frequency of the underlying timers.
Also, using #elif
saves you the second #endif
below.
I agree with the idea and the intention of this PR, reducing the number of clock domain synchronizations should be reduced if possible. However, the implementation in this PR is a bit messy with lots of preprocessor conditionals strewn all over the xtimer code, even more than before, and I haven't really grasped the whole configuration yet with STIMER_HZ vs XTIMER_HZ etc. (docs are missing on the new functions and macros). I have also been working on a concept (just ideas, no code), around combining fast timers and slow timers which may be useful for the discussion here. It is too long to write as a comment here, it would get lost in the thread, but I will write up an email to the developers mailing list later in the week or next week. The basic idea is to add a layer between the xtimer implementation and the periph/timer implementation which acts as a single timer but uses two underlying hardware timers, using the low frequency timer only for targets which are more than a few low frequency ticks into the future. By keeping the shim separate from xtimer it will be easier to review and easier to maintain. xtimer can also be simplified a bit if we could get rid of the 16 bit overflow timer list for timers which are less than 32 bits wide. |
@gebart we are willing to do the power profiling of your implementation if you want the help. All of our work has been driven by power consumption, so at times we have placed elegance second. We can probably get this approach cleaner but if there is a more elegant solution at the same power budget then lets go for that rather. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions. |
This PR is @Hyungsin's work, I just cleaned it up and rebased it for pushing upstream.
The PR attempts to minimize the number of times that xtimer_now is called, as this operation is far more expensive than you would expect.
On SAMR21, as part of their energy saving design, Atmel has the timer counter registers on a different clock domain in the chip, it is necessary to perform a synchronized read to get the current value of the counter (e.g here ). The nature of this clock sync is that it takes a few cycles of the clock driving the counter to synchronize. We are using a 32768 hz clock for xtimer as it operates during sleep (OSCULP) and in practice it takes several hundred microseconds to read the timer counter register. This does not appear to be a bug, it is just the cost of reading a register that is on a different clock domain (likely some metastability fix in the silicon).
What this means is that a program that simply wakes up and immediately sleeps takes roughly 2ms to do so. Initially we attributed this to PLL startup etc, but after digging in we found that is actually all negligible, the real cost is reading the timer register six times (four on wakeup, two on setting a new timer when going to sleep). A simple refactor of the xtimer code allows this to happen only three times (two on wakeup, one to set a new timer when going to sleep). Certain xtimer callbacks like unlocking a mutex can be assumed to take negligable time, so we allow them to be flagged as being trivial, so that now after the callback can be assumed to be within XTIMER_ISR_BACKOFF of now before the callback, removing a register read. Other changes include passing now as a parameter to internal functions instead of re-reading from the timer register.
To illustrate, look at this plot of a program just waking up and immediately going to sleep again using xtimer_usleep in a loop.
The first thing to note is that the overhead of xtimer_now is not affected by the primary CPU clock frequency, because it is on the timer's clock domain (both 8Mhz and 48Mhz take the same amount of time). This also confirms that the delay is not PLL, because the 8Mhz is a direct RC with no startup delay.
Secondly you can see with this PR the time is reduced from ~1.9ms to ~0.8ms. This is very useful for low power applications wanting to spend as little time as possible in high-power modes, and brings RIOT-OS more in line with Contiki and TinyOS in timer overhead.