IRQ double-fires, but only when handled by Hubris #536

timblakely · 2022-05-09T01:54:52Z

Hey all! I'm seeing some odd behavior when servicing interrupts in Hubris: I'm seeing interrupts triggering twice, but only when handled through hubris's DefaultHandler mechanism. I know this smacks of the usual "You just returned too fast and forgot to allow the IRQ acknowledge to go through the bus", but hear me out 😅

Custom ISR: single-firing

For a baseline test, I override the TIM_CC interrupt (technically TIM1_CC, but the SVD/PAC seems to have a typo) and trigger at the very center of the PWM signal. When in the interrupt, I set a GPIO pin, nop for a bit, acknowledge the pending interrupt in the timer, ensure it's propagated, then drop the GPIO pin.

#[interrupt]
fn TIM_CC() {
    let gpiof = unsafe { &*device::GPIOF::ptr() };
    gpiof.bsrr.write(|w| w.bs11().set_bit());
    for _ in 0..100 {
        cortex_m::asm::nop();
    }
    let timer = unsafe { &*device::TIM1::ptr() };
    timer.sr.write(|w| w.cc4if().clear());
    cortex_m::asm::dmb(); // Overkill; triple-checking to be sure that the IRQ
    cortex_m::asm::dsb(); // acknowledge went through. Reading back the 
    cortex_m::asm::isb(); // register below should be sufficient.
    while timer.sr.read().cc4if().bit_is_set() {}
    gpiof.bsrr.write(|w| w.br11().set_bit());
}

Cyan: PWM signal
Blue: GPIO toggling inside ISR

Interrupt trigger location
GPIO high during ISR

Seems to work as expected.

Hubris `notification`: double-firing

Now if I take the exact same code and put it in an Idolatry handle_notificaiton, I'm seeing the interrupt fire a second time between the end of the syscall_entry and when the sys_irq_control call returns.

const TIM1_CC_MASK: u32 = 1 << 2;

impl NotificationHandler for ServerImpl {
    fn current_notification_mask(&self) -> u32 {
        TIM1_CC_MASK
    }

    fn handle_notification(&mut self, bits: u32) {
        let gpiof = unsafe { &*device::GPIOF::ptr() };
        gpiof.bsrr.write(|w| w.bs11().set_bit());
        for _ in 0..100 {
            cortex_m::asm::nop();
        }
        let timer = unsafe { &*device::TIM1::ptr() };
        timer.sr.write(|w| w.cc4if().clear());
        cortex_m::asm::dmb();
        cortex_m::asm::dsb();
        cortex_m::asm::isb();
        while timer.sr.read().cc4if().bit_is_set() {}
        sys_irq_control(TIM1_CC_MASK, true);  // <--- only new line
        gpiof.bsrr.write(|w| w.br11().set_bit());
    }
}

Cyan: PWM signal
Blue: GPIO toggling inside ISR
Magenta: event_isr_enter from the new kernel profiling code from last week (enter/exit on DefaultHandler)
Yellow: event_syscall_enter from the new kernel profiling code (enter/exit on syscall_entry)

A bit busy of an image, sorry. Here's the sequence of events:

Timer IRQ is triggered
It's handled by Hubris' DefaultHandler (magenta)
The Hubris notification has reached handle_notification, and a GPIO pin is toggled high (blue)
A call to sys_irq_control(TIM1_CC_MASK, true) to handle the irq_control sycall (yellow)
A second timer IRQ is fired. Note that this happens prior to the control path returning from the sys_irq_control call
The sys_irq_control call returns, the GPIO pin is lowered, and the original Hubris notification call to handle_notification is complete.
A second Hubris notification is triggered, driving the GPIO pin high again in handle_notification
The call to sys_irq_control to re-enable the notification handler. Note that this does not cause a third interrupt to fire.
Post-notification syscall (not sure what, but I assume it's the sys_recv being called again within the Idolatry server's handle_n.

Thoughts

I'm running this in an Idolatry server and haven't gotten a chance to test it out via raw syscalls, but I can't see anything special within dispatch_n that would make me think the Idolatry framework would be the cause. I can also confirm that the second interrupt's IPSR is the same IRQ number as the first (26).
I also tried poking the corresponding NVIC[ICPR] to manually acknowledge the IRQ at the NVIC level but that caused a memory fault since handle_notification is in a task and thus unprivileged code (A Good Thing:tm:).

I double-checked that I acknowledged the timer's ISR bit by attempting to panic if it was still set just prior to re-enabling the notification:

      ...
      while timer.sr.read().cc4if().bit_is_set() {}
      cortex_m::asm::dmb();
      cortex_m::asm::dsb();
      cortex_m::asm::isb();
      if timer.sr.read().cc4if().bit_is_set() {
          panic!("It still was set :C");
      }
      sys_irq_control(TIM1_CC_MASK, true);

No panic :/

Both the syscall and the second interrupt occur before sys_irq_control has returned to the original task's code, so not sure what else I could/should be doing to prevent it.
AFAICT the #[interrupt] proc macro from cortex_m_rt doesn't do anything fancy aside from defining an $IRQ_trampoline
Not sure what the difference is between the first and second, but when the second iteration of handle_notification calls sys_irq_control the interrupt is nowhere to be found after the syscall.

There's still a good chance this is a case of PEBKAC, but considering that the same code block in a bare IRQ handler does not fire twice I'm at a loss as to why the Hubris notification would be coming through twice. Any ideas what may be happening here?

The text was updated successfully, but these errors were encountered:

timblakely · 2022-05-09T20:42:31Z

Also of note: if I remove the sys_irq_control(TIM1_CC_MASK, true) call the DefaultHandler is no longer double called (it's called exactly once after the task starts since we don't re-enable it).

timblakely · 2022-05-10T05:55:26Z

Okay, finally tracked this down!

TL;DR: Hubris needs a way to disable hardware-triggered interrupts from within DefaultHandler.

Finding the source of the double-trigger

The smoking gun turned out to be the ISPR register. On the very first time the interrupt fires, Hubris goes directly into DefaultHandler and everything looks good. ISPR is 0 - no pending IRQs, which is fine since we're servicing one - and IPSR[IRQ_ACTIVE] point at 27, which is the correct IRQ we're servicing. Inside of DefaultHandler it checks if any task requires the interrupt, notices one that does, and does a context switch from the current idle task to my Idolatry task that's waiting on that notification. During that context switch via pend_context_switch_from_isr, it sets the PendSV interrupt to immediately be called, which it dutifully does once DefaultHandler returns. But suddenly in PendSV, ISPR shows that the original interrupt is still* pending?

Now here's the trick: since this was a hardware triggered interrupt from TIM1, there are two things that need to happen by the end of the interrupt:

The IRQ needs to be disabled in the NVIC, which DefaultHandler does via disable_irq every time
The timer's interrupt acknowledge bit needs to be flipped to stop the timer from generating interrupts from whatever source was causing it (update, capture/compare, etc)

The former happens regardless of the source of the interrupt. However, the second one can only happen from within task code, which is only switched to outside of the originating interrupt. This is important: it's critical to clear the timer's IRQ flag inside of the interrupt, or the timer's interrupt immediately tries to be serviced again.

That's where the * above comes in: the initial interrupt was never "pending" in the first place since it was happily being serviced in DefaultHandler. But when the request to switch to PendSV was triggered, the timer hardware immediately tried to jump back into interrupt, but was forced to wait while the PendSV call was made. Instead the timer interrupt interrupt went "pending" due to being lower priority than the kernel, waiting for the chance to be serviced. After a chain of a few Kernel>Task branches, we eventually hit the handle_notification in the appropriate task, which dutifully proceeds to shut off the timer's interrupt via timer.sr.write(|w| w.cc4if().clear());. But the interrupt was already pending and the task lacks the privileges to poke the unpend bits of NVIC, so the interrupt fires a second time.

Workaround

Ironically #518 provided a rather clean workaround:

fn isr_enter() {
    let timer = unsafe { &*device::TIM1::ptr() };
    timer.sr.write(|w| w.cc4if().clear());
}

Since crate::profiling::event_isr_enter(); is quite literally the first line of DefaultHandler, that allows me to acknowledge the timer's interrupt inside of the original interrupt context. Once it's acknowledged, DefaultHandler continues and returns, no longer causing the original interrupt to immediately go pending.

Of course this happens every time an interrupt fires, but at least for my use case that's acceptable (it's the only interrupt and everything else is coordinated in hardware).

Performance implications

This also explains the double-firing I was seeing when discussing #517. It turns out that the 300+ cycle sys_irq_control call was actually a ~150 cycle irq_control syscall and a second timer interrupt being fired, handled in the remaining cycle overhead. Once that second "phantom" IRQ trigger is gone the overhead is <200 cycles overall, which is pretty negligible.

Permanent solution?

While the workaround above is a surprisingly easy solution in my use case, I don't have a sense of what the "right" solution might be from a security/safety standpoint. The profiling code is strictly designed for debugging; calling into user code directly from the kernel is not only not safe but also allows entering user code in a privileged context, which obviously can wreck all sorts of havoc.

cbiffle · 2022-05-10T15:38:03Z

Hi! I'm currently in the field bringing up some hardware, but, wanted to leave some thoughts with the available part of my brain.

This is a broader problem with peripherals that generate level-triggered internal interrupt signals. The way the ARM NVIC works, peripherals don't get an explicit interrupt acknowledge signal, so every interrupt source effectively has a different way of acknowledging it.

IIRC the NVIC does not set the Pending bit in response to an IRQ while that IRQ's handler is active, but will set it as soon as the handler is not active if the IRQ condition has not been cleared. Because we disable the interrupt in it (default) handler, the Pending state will not immediately percolate into Active state until we re-enable the interrupt.

I've thought about three alternative approaches to this -- there may be others.

When re-enabling an interrupt in sys_irq_control, also clear its pending flag. I've done this in systems in the past. I have a vague concern that this could cause you to miss interrupts.
Allow custom ISRs, like the overridden one you wrote for TIM1, to call into the default ISR behavior before/after performing some action. So you could poke the timer to clear the interrupt condition and then cause the interrupt to be handled by the default handler. This option is the most general, but would also require the most boilerplate in an application.
Provide a clear-pending syscall (possibly as an argument to sys_irq_control). If you're confident that you want to turn an IRQ back on and it's OK to have potentially missed intervening IRQs -- in this case, because you can't distinguish them from the IRQ being continuously asserted -- you could use this to prevent surprise double interrupts. What you'd probably do in your case is to clear CC4IF and then use this clear-pending operation. My main concern about this is that, unlike code written in a (properly prioritized) ISR, this is task code that could run much later than you'd expect and miss actual interrupt sources. But if tasks are properly prioritized, this is probably not a risk.

timblakely · 2022-05-11T05:43:52Z

Thanks! Again, absolutely zero hurry from my end. Congrats on the hardware delivery! :D

IIRC the NVIC does not set the Pending bit in response to an IRQ while that IRQ's handler is active, but will set it as soon as the handler is not active if the IRQ condition has not been cleared. Because we disable the interrupt in it (default) handler, the Pending state will not immediately percolate into Active state until we re-enable the interrupt.

Yup, that exactly jives with what I'm seeing. The interrupt would immediately try to trigger again, but since Hubris sets all IRQs at a lower priority than PendSV and SVCall, it goes pending when DefaultHandler sets PENDSVSET.

I've thought about three alternative approaches to this -- there may be others.

All these sound reasonable approaches, and all would work for my use case, though it always makes me itchy when the kernel calls into "arbitrary user code", especially in a privileged context like an ISR.

Could there be some way to define how to acknowledge the additional peripheral IRQ from app.toml somehow, considering that's where they're defined...?

timblakely · 2022-05-17T21:55:43Z

Regarding 2 above, luckily since DefaultHandler is pub it appears this can be as simple as:

#[interrupt]
unsafe fn TIM_CC() {
    // Clear the hardware timer directly.
    let timer = &*device::TIM1::ptr();
    timer.sr.write(|w| w.cc4if().clear());
    // Call into Hubris' default IRQ handler and let it take care of the rest.
    kern::arch::DefaultHandler();
}

While this is inherently unsafe and it would be nice if there were a low(no? :)-overhead way of getting Hubris to do this automagically, this doesn't seem to add too much boilerplate and only a single small unsafe fn. Of course this may be architecture-dependent given we're reaching into kern::arch, so not sure how much this hurts portability...

cbiffle · 2022-05-20T22:15:31Z

...huh. I didn't realize DefaultHandler is pub. So, you can do 2 today. That's nice.

so not sure how much this hurts portability...

I have yet to see a portable ISR. :-) I don't think this hurts your portability in the least -- if you wanted to make the jump to (say) a RISC-V core, or an ARMv7-A core, you'd be dealing with different timers and a different interrupt handler model regardless.

timblakely · 2022-05-21T06:01:03Z

Hah, that's... a really good point. With that in mind, seems like calling kern::arch::DefaultHandler "manually" like this actually allows for a pretty flexible-but-low-overhead way of flipping these hardware bits. It leaves it up to the user whether the bit is acknowledged before or after the DefaultHandler processing, and since it's done in the originally triggered interrupt AFAICT there's no chance of missing an interrupt like the other potential workarounds.

The only remaining quirk is that it requires these #[interrupt]s to be defined in the kernel. I think this comes down to a combination of 1) the way that cortex-m-rt sets up the VECTOR_TABLE in its linker script, and 2) the fact that compilation is completely separate per-task basis so any interrupt handlers defined in tasks don't get fed to the kernel linking process. Not a deal breaker really and actually somewhat of A Good Thing ™️ in that it keeps privileged contexts/code limited to the kernel crate. Of course it does mean that you can't have a fully-self-contained/update-the-app.toml-and-go Task crate that uses any level-triggered hardware since the ISR to clear it would have to be defined in the kernel. Would be nice to be able to define them in the various task crates, but that would require some seemingly invasive surgery to build/xtask/src/dist.rs; not sure if it's worth the effort for now.

All that is to say calling into kern::arch::DefaultHandler() actually seems like it'd work flawlessly for my needs, and seems flexible enough to work for quite a few other use cases as well! :D Feel free to close if you'd want to track this in a bug that's more narrow/focused in scope.

cbiffle added this to the Unscheduled milestone Apr 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IRQ double-fires, but only when handled by Hubris #536

IRQ double-fires, but only when handled by Hubris #536

timblakely commented May 9, 2022 •

edited

Loading

timblakely commented May 9, 2022

timblakely commented May 10, 2022 •

edited

Loading

cbiffle commented May 10, 2022

timblakely commented May 11, 2022 •

edited

Loading

timblakely commented May 17, 2022 •

edited

Loading

cbiffle commented May 20, 2022

timblakely commented May 21, 2022

IRQ double-fires, but only when handled by Hubris #536

IRQ double-fires, but only when handled by Hubris #536

Comments

timblakely commented May 9, 2022 • edited Loading

Custom ISR: single-firing

Hubris notification: double-firing

Thoughts

timblakely commented May 9, 2022

timblakely commented May 10, 2022 • edited Loading

Finding the source of the double-trigger

Workaround

Performance implications

Permanent solution?

cbiffle commented May 10, 2022

timblakely commented May 11, 2022 • edited Loading

timblakely commented May 17, 2022 • edited Loading

cbiffle commented May 20, 2022

timblakely commented May 21, 2022

timblakely commented May 9, 2022 •

edited

Loading

Hubris `notification`: double-firing

timblakely commented May 10, 2022 •

edited

Loading

timblakely commented May 11, 2022 •

edited

Loading

timblakely commented May 17, 2022 •

edited

Loading