Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notification/interrupt latency: resetting IRQ notification mask takes hundreds of cycles #517

Closed
timblakely opened this issue Apr 29, 2022 · 16 comments

Comments

@timblakely
Copy link

Greetings!

I'm on an incredibly misguided misadventure experimenting with using Hubris in a BLDC application. Given the high currents and potential for copious amounts of magic smoke, the, erm... "pucker factor" is pretty high, but so far so good! 😅

I've gotten Hubris's codebase to the point where I can depend on it as if it were an external crate (lots of env variable overriding and explicit path setting, but that's for another bug), and am now working through the intricacies of how Hubris handles interrupts as notifications. It appears that there's a Not Small™️ amount of overhead associated with interrupt/notification handling, at least in terms of a 40KHz control loop. I'm building this on an H723 where I've confirmed I can boost the SYSCLK to 520MHz (!), with an AHBx bus speed of 270MHz, which means I've got just about 13k instructions to do the full commutation + FoC computation, including overhead.

I've been benchmarking the interrupt latency using an Idolatry server via idol_runtime::dispatch_n and the handle_notitication section of the NotificationHandler trait. For the benchmark setup I've got a timer going at full tilt, have the OC1 channel drive a pin high, and catch the update interrupt. The moment we enter DefaultHandler from arm_m.rs I drive a GPIO pin high via the BSRR register so I can tell exactly when we enter the interrupt handler. Hubris then detects the task that needed the interrupt, context switches into PendSV, and resumes the task. Since this is an idol server, the sys_recv call in dispatch_n returns and in turn calls the handle_notification method on the server.

Apologies, this is kinda wordy; here's a probe of what's going on here (this is done at opt-level = 3, which is both different than the stock Hubris config and also much larger in terms of flash):

latency

  1. First, the timer goes off, driving the yellow trace and firing off the interrupt. This is the start of the 40KHz commutation handler.
  2. I hacked in a single instruction in DefaultHandler to drive the cyan trace high, indicating the first moment we get into Hubris.
  3. Inside of handle_notification, I first drive the magenta pin high, do some "work" inside my custom interrupt handler (200x core::asm::nop() calls just for this example), then drive it low when I'm effectively done with what I need to do in my interrupt handler.
  4. Now that I've handled the notification (née interrupt), I need to update Hubris that we want to handle it again. This is done via a sys_irq_control call, which I've wrapped with a call to drive the blue tracehigh and low before/after the call:
    unsafe { core::ptr::write_volatile(0x58021018 as *mut u32, 1 << 8); }
    sys_irq_control(TIM1_UPD_MASK, true);
    unsafe { core::ptr::write_volatile(0x58021018 as *mut u32, 1 << (8 + 16)); }
  5. Finally, when we leave the handle_notification I set the cyan trace low.

Side note: I need to re-confirm this is at 520MHz... I had previously done so via probing MCO, but the 40ns delay between the magenta drop and the blue raise - two consecutive writes to BSRR - admittedly is a bit concerning, considering that translates to ~10 instructions instead of just a single one. Will look at the disassembly after filing this...

If I'm reading this correctly, this takes about 1.3us to get into the handle_notification call, or ~680 instructions (!). I know there's a decent amount of overhead in context switching, finding the right handling task, checking masks. setting up the MPU, etc. It still takes quite a while even on O3, though judging by the use cases I can find in the repo I suspect interrupt latency has not been a target for optimization. That said the more surprising part is that simply notifying Hubris that I want to reset and listen again for the interrupt (sys_irq_control) takes over 3x as long as the actual interrupt itself (!!!), which translates to multiple hundreds of cycles at 520MHz. This seems... inefficient? Looking through the code it seems to need to call into the kernel to do so, and I haven't been able to trace beyond the kernel context switch in sys_irq_control_stub. Any ideas on how to handle notifications like this faster? Am I just missing an optimization flag or three?

As a workaround: how does one go about overriding interrupt handlers to use 'em outside of the Hubris framework? The reference manual suggests they're declared weak, but I tried using cortex-m-rt's #[interrupt] decorator around a fn TIM1_UP and it didn't seem to override the DefaultHandler Hubris handler. I don't see any instances of overriding the default interrupt handler within the Hubris codebase, though there's a pretty good chance I just plain missed it :)

@mkeeter
Copy link
Collaborator

mkeeter commented Apr 29, 2022

At least some of this is due to the linear search in both the IRQ handler and the irq_control functions.

I've been looking for an excuse to use perfect hashing, so I'm going to poke around and see if I can make this O(1) instead of O(n). I'll update the issue when there's a branch to test. It's hard to say whether whether this is the limiting factor, but you've got a good test case for investigation!

(I'm not sure about declaring interrupts outside of the kernel's ownership, but one of the other engineers may know off-hand)

@cbiffle
Copy link
Collaborator

cbiffle commented Apr 29, 2022

Declaring an #[interrupt] in your main.rs should override Hubris's interrupt binding for that ISR. However, the interaction between the cortex-m-rt weak symbols and the linker is a bit of a house of cards, and I could believe that it might not be working. Where'd you put your ISR out of curiosity?

@cbiffle
Copy link
Collaborator

cbiffle commented Apr 29, 2022

@timblakely - incidentally - partially inspired by your use case here, I've added a sketch of a kernel mechanism for doing pin profiling on a scope or logic analyzer. Might be useful to you. #518

In this screenshot I've got sys_irq_control on stm32h743 @400Mhz highlighted (syscall number 7 in the tiny blue decode numbers); it takes about 352 ns, or 140 cycles, with some margin of error (my logic analyzer is only sampling at 250Msamp/s).

2022-04-29-124504_3840x4320_scrot

That measurement will be a little imprecise because it's toggling the pin from the kernel, meaning, it will miss the register stacking code on kernel entry and exit. That should add only a few tens of cycles though. A sufficiently motivated individual could instrument the SVCall entry/exit sequence assembly in sys/kern/src/arch/arm_m.rs to get more precise numbers. (Edit: but the result would make the assembly code SoC specific and would probably not be able to merge upstream, to be clear. S'why I've done it the way I have.)

@cbiffle
Copy link
Collaborator

cbiffle commented Apr 29, 2022

@timblakely - a further thought - you may wish to test whether opt-level=3 is actually helping you vs opt-level="s" or "z". In the Rust port of my Cortex-M graphics demos I wound up using "z" because I got better results.

@timblakely
Copy link
Author

Thanks for the input! And to be clear, if we're talking O(low-hundreds) of cycles spent in kernel space per loop that's more than adequate for my purposes; only brought it up since O(thousands) seemed a potential target for optimization - and/or that I was doing something obviously wrong!

Thinking this through I can confirm this is running at 520MHz, otherwise my timer reload and compare calculations wouldn't be accurate enough to give me the 40kHz PWM. So minus one variable :)

@mkeeter Thanks for looking into this! I've got a repro commit tagged to try it out (zero hurry from my end, of course :).

@cbiffle I had it in one of my tasks' main.rs, but I don't think it was actually in my kernel main.rs; I wonder if that could be it...? I also have a bit of a separate (read: wonky; again, longer discussion for another bug) build environment at the moment, so that could also affect it. I'll merge the two separate repos together and try again in the proper Hubris build directory.

And I love the profiling in #518! It's a formalization of my hacky flip-bits-in-arbitrary-GPIOx-bits-at-random-points-in-the-Kernel, and is one less change in the core Hubris repo I need to keep track of. At the risk of yet another feature, might make sense to gate it behind a profiling flag so that it doesn't even compile when not enabled, much less spend the cycles on the locks/calls/checks when they're not needed. Then again if they're noops then they're probably going to be optimized away regardless. Anyway, easy enough to patch in when needed if it's not intended to ever be fully merged. How do you like the Saleae, by the way? Been eying one for a while, and my Rigol's logic analyzer's capabilities are rather limited, to put it generously.

Regarding opt-level, I did find that 3 was ever so slightly faster than z in terms of absolute performance, though at a pretty staggering flash cost. I'll try again with various opt levels using the more formal profiling approach and report back.

@timblakely
Copy link
Author

Quick update: looks like there were a few changes to build/xtask recently - welcome ones surrounding configuration AFAICT! - but they've broken my depend-on-Hubris-from-a-remote-crate-via-toothpicks-and-duct-tape kludge, so I'm redoing my H723 changes in a proper Hubris branch. Will try to get some data later this weekend!

@timblakely
Copy link
Author

timblakely commented May 1, 2022

Okay, have some data! Also, apologies ahead of time; I'm a bit long-winded at times 😅

Initial ISR latency

All experiments done with codegen-units = 1 and lto = true (looking forward to rust-lang/rust#90357 landing; got bit by DWARF errors in my previous G4-based BLDC 🥴 )

With opt-level="z"

From the moment the timer's update event fires (orange arrow at the top) , it takes about ~42 cycles to hit the isr_enter part of the profiling code from #518 (cyan). From there it's another ~153 cycles until isr_exit. So just under 200 to handle the ISR. There's about 80 cycles worth of what I assume is PendSV handling (?), then we hit the secondary_syscall part (blue), which lasts about 625 cycles (1.2us). Another ~80 cycles, then we hit my interrupt handler where I nop for a bit (yellow). Important to note that I pull down the yellow trace before calling sys_irq_control(TIM1_UPD_MASK, true);. Finally we've got the syscall_enter and _exit part (magenta), which takes about ~360 cycles.

rigol

(please ignore the repeated final cyan pulse; my timer event is double-triggering for some weird reason that I haven't sorted out yet)

So in sum, approximate "ish" values:

isr: 153 cycles
secondary_syscall: 625 cycles
syscall: 360 cycles
PendSV assembly entrance/exit: ~80 cycles (roundtrip)

With opt-level=3

rigol

isr: 127 cycles
secondary_syscall: 473 cycles
syscall: 135 cycles
PendSV assembly entrance/exit: ~70 cycles (roundtrip)

With opt-level=2

rigol

isr: 127 cycles
secondary_syscall: 470 cycles
syscall: 150 cycles
PendSV assembly entrance/exit: ~66 cycles (roundtrip)

With opt-level=1

Gah, linking failed: rust-lld: error: section '.text' will not fit in region 'FLASH': overflowed by 3610 bytes. I think this is just my various flash allocations being too tiny, but the nice, helpful your flash allocation for $TASK is X but is actually Y bytes doesn't show up. Lemme see if I can fix this... Wow, looks like I needed to double - and for some tasks triple (!) - the flash requirements for opt-level=1.

Side note: feature request? Detect which task changed and recompile only that + any kernel deps when updating app.toml instead of just checking if the entire file changed. PAC crates take a loooong time to compile...

... Huh. At opt-level=1 I get a HardFault...?

#[allow(unused_variables)]
#[doc(hidden)]
#[cfg_attr(cortex_m, link_section = ".HardFault.default")]
#[no_mangle]
pub unsafe extern "C" fn HardFault_(ef: &ExceptionFrame) -> ! {
    loop {
        // add some side effect to prevent this from turning into a UDF instruction
        // see rust-lang/rust#28728 for details
        atomic::compiler_fence(Ordering::SeqCst);
    }
}

Stack is borked, so no clue what caused the fault. No data from opt-level=1 for now, sorry :/

With opt-level=0

Agreed with @cbiffle's comment:

[profile.dev]
opt-level = 1 # no optimizations was just too painful in terms of flash size

With opt-level="s"

rigol

isr: 125 cycles
secondary_syscall: 478 cycles
syscall: 156 cycles
PendSV assembly entrance/exit: ~72 cycles (roundtrip)

Custom ISR

I can confirm that I can override a given interrupt via cortex-m-rt's #[interrupt], but only if the interrupt is defined in the app; does not work if defined in a task or driver. This is actually pretty heartening, at least for my use case! The core computations in the ISR for FoC are somewhat decoupled from the rest of the system, and any interactions between SPI/FDCAN/UART are effectively async and O(kHz) or less, so a perfect fit for Hubris' task model.

@steveklabnik
Copy link
Contributor

This thread is all very cool and exciting, I don't have much to say other than:

I think this is just my various flash allocations being too tiny, but the nice, helpful your flash allocation for $TASK is X but is actually Y bytes doesn't show up

This is #439.

@steveklabnik
Copy link
Contributor

Side note: feature request? Detect which task changed and recompile only that + any kernel deps when updating app.toml instead of just checking if the entire file changed. PAC crates take a loooong time to compile...

I would also love to do this, just work that hasn't been prioritized just yet. We started off aggressively trying to cache, and then kept running into problems where new things should cause a rebuild, and eventually said "you know what correctness matters more" and ended up just doing what we're doing now. It can be better though, you're right, it's just not easy. Someday...

@mkeeter
Copy link
Collaborator

mkeeter commented May 2, 2022

I have a draft of the perfect hashing up in #519 / the perfect-irq-hashing branch.

You're welcome to rebase onto it and see if it helps!

It will depend on how many interrupts are in your app (i.e. defined in the app.toml and associated with a particular task). I saw a 6% improvement for the IrqControl syscall in an app with 6 IRQs, and a 20% improvement in an app with 12 IRQs.

@cbiffle
Copy link
Collaborator

cbiffle commented May 2, 2022

@timblakely thanks for doing the parameter sweep of the optimization levels, it seems like 3 really is helping you with latency. That's great. Out of curiosity, is your code in SRAM?

(please ignore the repeated final cyan pulse; my timer event is double-triggering for some weird reason that I haven't sorted out yet)

I feel like I discover this on every application as soon as I turn on external event profiling like this. :-)

the nice, helpful your flash allocation for $TASK is X but is actually Y bytes doesn't show up.

AFAICT that message never shows up. I've proposed hiding the linker messages and printing our own but at the time, folks were concerned about our estimate being unfaithful. I'm no longer concerned about this, fwiw.

Side note: feature request? Detect which task changed and recompile only that + any kernel deps when updating app.toml instead of just checking if the entire file changed. PAC crates take a loooong time to compile...

If we get around to #240 this should go away. The current aggressive cleaning is, well, too aggressive. Ironically, given that the PAC crates are the longest things to compile in most applications, the PAC crates are largely responsible for the need for this -- they like to deposit linker scripts into random places in the build tree without properly generating the rebuild-if-changed data for Cargo to know when it needs to redo work. I think we can fix this in our build system. (I believe we're already ignoring the PAC scripts and using our own, so, should be doable.)

Incidentally, if you make sure to harmonize features across the cortex-m / cortex-m-rt / PAC crates in all your tasks, you can usually get it down two two PAC builds -- one for userland, one for the kernel. (Current minimum is two in practice because all PACs use features to do rt and get a vector table, which we need in the kernel and can't have in userland; if I were benevolent dictator, that'd be a second package instead of conditional compilation.)

@timblakely
Copy link
Author

timblakely commented May 3, 2022

I would also love to do this, just work that hasn't been prioritized just yet.

To be clear: definitely appreciate the cycles spent on issues like these, but I'm absolutely aware that the priority of Hubris - and Oxide as a whole - is getting products shipped/out the door; feature requests from out of nowhere are super low priority. Again, zero hurry from my end! Just really enjoying Hubris and its ethos so far and want to contribute where/when I can :)

I have a draft of the perfect hashing up... You're welcome to rebase onto it and see if it helps!

Can confirm that it definitely helps. I've got three interrupts defined in my app.toml. For the initial secondary_syscall according to the profiling code above (dark blue trace) I'm seeing ~447 cycles, or about a 5% speedup on the fastest opt-level=2. For the interrupt reset call (sys_irq_control(TIM1_UPD_MASK, true)) I'm seeing an even bigger win, at about 15%. This is great, cheers! FWIW you've got a +1 from me on eventually merging it :D

I feel like I discover this on every application as soon as I turn on external event profiling like this. :-)

Aye, and then it becomes "well crap, now if only I could remember how I fixed this the last half dozen times this happened...?" 😅

AFAICT that message never shows up.

I got it to show up exactly once during my various experiments with Hubris. Unfortunately I'm afraid I can't seem to reproduce the environment that allowed it to happen :( Not an issue at all though, since at least rust-lld gives some kind of a hint as to what was going on with all the section-too-large spam.

Incidentally, if you make sure to harmonize features across the cortex-m / cortex-m-rt / PAC crates in all your tasks, you can usually get it down two two PAC builds

Ah, thanks for the lead. In adding support for the STM32h72x/3x series I've had to add a few additional base and family- features. Sounds like I may have missed a Hubris target or three that might implicitly depend on stm32h7/_____ without actually updating them to target stm32h7/stm32h735 (the H735 target enables the entire H72x and H73x series). The actual delta to support the H72x and H73x it isn't to bad overall, but happy to maintain it in my own fork to reduce the maintenance surface area Oxide has to support!

Out of curiosity, is your code in SRAM?

I... erm... good point. I assume not...? I'm trying to follow Hubris' safety and memory protection guidelines as close as possible. The app-h743.toml has execute = false on its outputs.ram with the comment # let's assume XN until proven otherwise, which I figured was a safe bet to follow :) I'm just using the outputs.ram memory space* for the various stacks, which I just assume is the default place the linker scripts put the separate kernel/task stacks. While I do have an ITCM core-coupled section available and defined in the app.toml, I now suspect it's not used at all. Last time I checked the cortex-m framework automagically enabled the ART (pdf) which mostly mitigated most differences between executing from flash and from CCM, at least on the G4. Thinking it through, I could put the interrupt handler in fast ITCM, but then calling into the kernel would go back to executing from flash, ya? Or is there a configuration somewhere that can enable the kernel to be loaded into SRAM somewhere and executed from there (which, as I type it, sounds like a Very Bad Idea from a safety standpoint)?

* Is there a Hubris-specific way of annotating "this goes here" in the various Hubris-specific linker scripts?

Overriding ISRs

Side note: let me know if I should open another bug for this discussion, or if there's a more appropriate place for open-ended design discussions! Just trying to limit the "blast radius" of my completely-not-what-Hubris-was-designed-for ramblings :)

One thing that I'm running into with custom ISR overriding is that AFICT they have to be defined in app/src/main.toml, as #[interrupt] annotations seem to be ignored/dropped in task and driver targets; makes sense given the distinct linker scripts for tasks and the kernel. That said, it presents a bit of a headache from both organizational and memory-sharing standpoints. Since all overridden ISRs must be defined in the kernel, that means that under normal memory protection assumptions 1) it can no longer be refactored into a task, and 2) it's effectively no longer accessible via an Idolatry API.

The former can be hacked around by telling the MPU that there's some arbitrary memory defined in $CHIP.toml that can be in the uses = [...] clause in both the main app task (?) and the respective task definitions. Of course this has both the overhead of locking/accessing global mutable statics and the fact that this completely circumvents Hubris' overarching safety guidelines, which seems like A Bad Thing™️.

The latter is a bit more nuanced. It seems like Idolatry servers and APIs are intended to be implemented as tasks, which means if a custom ISR is configured by an external task - i.e. the PWM duty cycles in a BLDC - at the moment it requires coordination between the Idolatry server and the relevant overridden ISR. I don't have a good mental model of how to coordinate between ISRs that are required to be defined in app/src/main.rs and (task|drv)/src/main.rs tasks, at least within the context of Hubris' memory model.

@timblakely
Copy link
Author

Thanks again for the discussion and pointers! The PRs already helped quite a bit, and since any additional optimization will probably be implementation-dependent I'm going to close this.

That said, one last question: I know Oxide is hard at work and neck-deep in getting the racks out the door and I definitely don't want to add any unnecessary noise. Is there a better place to put more open-ended discussions on longer-term Hubris ideas/feature requests that are likely outside the immediate Oxide dev scope?

@cbiffle
Copy link
Collaborator

cbiffle commented May 20, 2022

Is there a better place to put more open-ended discussions on longer-term Hubris ideas/feature requests that are likely outside the immediate Oxide dev scope?

Missed this before. Please feel free to continue filing issues here, (1) we don't really have a better option and (2) I'm interested to hear user reports even if I can't immediately act on them.

Something else has occurred to me, which you might want to experiment with. There's this big comment at the top of the arm_m support talking about the use of PendSV. One of the things it alludes to, but doesn't flat-out state, is that our use of PendSV here will likely hurt interrupt response time. It's hard to say by how much, off the top of my head -- thanks to tail-chaining on the fancier M-series cores (like the one you're using) it doesn't mean we have an entire extra exception entry-exit sequence. However, both of the ISRs are going to do some save/restore because they're in Rust and save callee-save registers. So, it would be at least a few cycles cheaper to modify DefaultHandler to do the full context save/restore sequence, instead of poking PENDSVSET and returning.

With the timescales you're seeing, a few cycles might not be very interesting, but, I wanted to mention it.

@timblakely
Copy link
Author

timblakely commented May 21, 2022

Interesting. From the code comment it sounds like the only interrupt that isn't very likely to switch contexts is the SysTick one. Combined with the recent updates to make task IRQ lookup O(1) via it could very well pay dividends to do that inside of DefaultHandler instead of kicking the can down to PendSV. But as you said, for my specific use case I'm pretty sure I can wire enough of the latency-critical stuff up in hardware that any further optimization would be premature. Appreciate the pointer though; good to know of potential optimization targets if/when we eventually need 'em! 👍

...thanks to tail-chaining on the fancier M-series cores (like the one you're using) it doesn't mean we have an entire extra exception entry-exit sequence.

Ironically I came across that here and there in the various Cortex-M NVIC app notes, but never really gave it much thought until I saw your shock absorber implementation in m4vga-rs. That's a pretty slick little trick, that is!

@cbiffle
Copy link
Collaborator

cbiffle commented May 26, 2022

Heh, thanks! There's a discussion of it in prose here in case that's useful.

I've merged a collection of kernel cleanups that, among other things, ought to further reduce interrupt/syscall latency. They're really intended for ARMv6-M, but they should show some improvement on your fancier M4. I'm not sure how much of a kernel diff you're carrying but this might be worth looking at. (Also it turns out there was a subtle bug in the kernel entry asm -- fixed now.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants