Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IRQ double-fires, but only when handled by Hubris #536

Open
timblakely opened this issue May 9, 2022 · 7 comments
Open

IRQ double-fires, but only when handled by Hubris #536

timblakely opened this issue May 9, 2022 · 7 comments
Milestone

Comments

@timblakely
Copy link

timblakely commented May 9, 2022

Hey all! I'm seeing some odd behavior when servicing interrupts in Hubris: I'm seeing interrupts triggering twice, but only when handled through hubris's DefaultHandler mechanism. I know this smacks of the usual "You just returned too fast and forgot to allow the IRQ acknowledge to go through the bus", but hear me out 😅

Custom ISR: single-firing

For a baseline test, I override the TIM_CC interrupt (technically TIM1_CC, but the SVD/PAC seems to have a typo) and trigger at the very center of the PWM signal. When in the interrupt, I set a GPIO pin, nop for a bit, acknowledge the pending interrupt in the timer, ensure it's propagated, then drop the GPIO pin.

#[interrupt]
fn TIM_CC() {
    let gpiof = unsafe { &*device::GPIOF::ptr() };
    gpiof.bsrr.write(|w| w.bs11().set_bit());
    for _ in 0..100 {
        cortex_m::asm::nop();
    }
    let timer = unsafe { &*device::TIM1::ptr() };
    timer.sr.write(|w| w.cc4if().clear());
    cortex_m::asm::dmb(); // Overkill; triple-checking to be sure that the IRQ
    cortex_m::asm::dsb(); // acknowledge went through. Reading back the 
    cortex_m::asm::isb(); // register below should be sufficient.
    while timer.sr.read().cc4if().bit_is_set() {}
    gpiof.bsrr.write(|w| w.br11().set_bit());
}

image

Cyan: PWM signal
Blue: GPIO toggling inside ISR

  1. Interrupt trigger location
  2. GPIO high during ISR

Seems to work as expected.

Hubris notification: double-firing

Now if I take the exact same code and put it in an Idolatry handle_notificaiton, I'm seeing the interrupt fire a second time between the end of the syscall_entry and when the sys_irq_control call returns.

const TIM1_CC_MASK: u32 = 1 << 2;

impl NotificationHandler for ServerImpl {
    fn current_notification_mask(&self) -> u32 {
        TIM1_CC_MASK
    }

    fn handle_notification(&mut self, bits: u32) {
        let gpiof = unsafe { &*device::GPIOF::ptr() };
        gpiof.bsrr.write(|w| w.bs11().set_bit());
        for _ in 0..100 {
            cortex_m::asm::nop();
        }
        let timer = unsafe { &*device::TIM1::ptr() };
        timer.sr.write(|w| w.cc4if().clear());
        cortex_m::asm::dmb();
        cortex_m::asm::dsb();
        cortex_m::asm::isb();
        while timer.sr.read().cc4if().bit_is_set() {}
        sys_irq_control(TIM1_CC_MASK, true);  // <--- only new line
        gpiof.bsrr.write(|w| w.br11().set_bit());
    }
}

image

Cyan: PWM signal
Blue: GPIO toggling inside ISR
Magenta: event_isr_enter from the new kernel profiling code from last week (enter/exit on DefaultHandler)
Yellow: event_syscall_enter from the new kernel profiling code (enter/exit on syscall_entry)

A bit busy of an image, sorry. Here's the sequence of events:

  1. Timer IRQ is triggered
  2. It's handled by Hubris' DefaultHandler (magenta)
  3. The Hubris notification has reached handle_notification, and a GPIO pin is toggled high (blue)
  4. A call to sys_irq_control(TIM1_CC_MASK, true) to handle the irq_control sycall (yellow)
  5. A second timer IRQ is fired. Note that this happens prior to the control path returning from the sys_irq_control call
  6. The sys_irq_control call returns, the GPIO pin is lowered, and the original Hubris notification call to handle_notification is complete.
  7. A second Hubris notification is triggered, driving the GPIO pin high again in handle_notification
  8. The call to sys_irq_control to re-enable the notification handler. Note that this does not cause a third interrupt to fire.
  9. Post-notification syscall (not sure what, but I assume it's the sys_recv being called again within the Idolatry server's handle_n.

Thoughts

  • I'm running this in an Idolatry server and haven't gotten a chance to test it out via raw syscalls, but I can't see anything special within dispatch_n that would make me think the Idolatry framework would be the cause. I can also confirm that the second interrupt's IPSR is the same IRQ number as the first (26).

  • I also tried poking the corresponding NVIC[ICPR] to manually acknowledge the IRQ at the NVIC level but that caused a memory fault since handle_notification is in a task and thus unprivileged code (A Good Thing:tm:).

  • I double-checked that I acknowledged the timer's ISR bit by attempting to panic if it was still set just prior to re-enabling the notification:

          ...
          while timer.sr.read().cc4if().bit_is_set() {}
          cortex_m::asm::dmb();
          cortex_m::asm::dsb();
          cortex_m::asm::isb();
          if timer.sr.read().cc4if().bit_is_set() {
              panic!("It still was set :C");
          }
          sys_irq_control(TIM1_CC_MASK, true);

    No panic :/

  • Both the syscall and the second interrupt occur before sys_irq_control has returned to the original task's code, so not sure what else I could/should be doing to prevent it.

  • AFAICT the #[interrupt] proc macro from cortex_m_rt doesn't do anything fancy aside from defining an $IRQ_trampoline

  • Not sure what the difference is between the first and second, but when the second iteration of handle_notification calls sys_irq_control the interrupt is nowhere to be found after the syscall.

There's still a good chance this is a case of PEBKAC, but considering that the same code block in a bare IRQ handler does not fire twice I'm at a loss as to why the Hubris notification would be coming through twice. Any ideas what may be happening here?

@timblakely
Copy link
Author

Also of note: if I remove the sys_irq_control(TIM1_CC_MASK, true) call the DefaultHandler is no longer double called (it's called exactly once after the task starts since we don't re-enable it).

@timblakely
Copy link
Author

timblakely commented May 10, 2022

Okay, finally tracked this down!

TL;DR: Hubris needs a way to disable hardware-triggered interrupts from within DefaultHandler.

Finding the source of the double-trigger

The smoking gun turned out to be the ISPR register. On the very first time the interrupt fires, Hubris goes directly into DefaultHandler and everything looks good. ISPR is 0 - no pending IRQs, which is fine since we're servicing one - and IPSR[IRQ_ACTIVE] point at 27, which is the correct IRQ we're servicing. Inside of DefaultHandler it checks if any task requires the interrupt, notices one that does, and does a context switch from the current idle task to my Idolatry task that's waiting on that notification. During that context switch via pend_context_switch_from_isr, it sets the PendSV interrupt to immediately be called, which it dutifully does once DefaultHandler returns. But suddenly in PendSV, ISPR shows that the original interrupt is still* pending?

Now here's the trick: since this was a hardware triggered interrupt from TIM1, there are two things that need to happen by the end of the interrupt:

  1. The IRQ needs to be disabled in the NVIC, which DefaultHandler does via disable_irq every time
  2. The timer's interrupt acknowledge bit needs to be flipped to stop the timer from generating interrupts from whatever source was causing it (update, capture/compare, etc)

The former happens regardless of the source of the interrupt. However, the second one can only happen from within task code, which is only switched to outside of the originating interrupt. This is important: it's critical to clear the timer's IRQ flag inside of the interrupt, or the timer's interrupt immediately tries to be serviced again.

That's where the * above comes in: the initial interrupt was never "pending" in the first place since it was happily being serviced in DefaultHandler. But when the request to switch to PendSV was triggered, the timer hardware immediately tried to jump back into interrupt, but was forced to wait while the PendSV call was made. Instead the timer interrupt interrupt went "pending" due to being lower priority than the kernel, waiting for the chance to be serviced. After a chain of a few Kernel>Task branches, we eventually hit the handle_notification in the appropriate task, which dutifully proceeds to shut off the timer's interrupt via timer.sr.write(|w| w.cc4if().clear());. But the interrupt was already pending and the task lacks the privileges to poke the unpend bits of NVIC, so the interrupt fires a second time.

Workaround

Ironically #518 provided a rather clean workaround:

fn isr_enter() {
    let timer = unsafe { &*device::TIM1::ptr() };
    timer.sr.write(|w| w.cc4if().clear());
}

Since crate::profiling::event_isr_enter(); is quite literally the first line of DefaultHandler, that allows me to acknowledge the timer's interrupt inside of the original interrupt context. Once it's acknowledged, DefaultHandler continues and returns, no longer causing the original interrupt to immediately go pending.

Of course this happens every time an interrupt fires, but at least for my use case that's acceptable (it's the only interrupt and everything else is coordinated in hardware).

Performance implications

This also explains the double-firing I was seeing when discussing #517. It turns out that the 300+ cycle sys_irq_control call was actually a ~150 cycle irq_control syscall and a second timer interrupt being fired, handled in the remaining cycle overhead. Once that second "phantom" IRQ trigger is gone the overhead is <200 cycles overall, which is pretty negligible.

Permanent solution?

While the workaround above is a surprisingly easy solution in my use case, I don't have a sense of what the "right" solution might be from a security/safety standpoint. The profiling code is strictly designed for debugging; calling into user code directly from the kernel is not only not safe but also allows entering user code in a privileged context, which obviously can wreck all sorts of havoc.

@cbiffle
Copy link
Collaborator

cbiffle commented May 10, 2022

Hi! I'm currently in the field bringing up some hardware, but, wanted to leave some thoughts with the available part of my brain.

This is a broader problem with peripherals that generate level-triggered internal interrupt signals. The way the ARM NVIC works, peripherals don't get an explicit interrupt acknowledge signal, so every interrupt source effectively has a different way of acknowledging it.

IIRC the NVIC does not set the Pending bit in response to an IRQ while that IRQ's handler is active, but will set it as soon as the handler is not active if the IRQ condition has not been cleared. Because we disable the interrupt in it (default) handler, the Pending state will not immediately percolate into Active state until we re-enable the interrupt.

I've thought about three alternative approaches to this -- there may be others.

  1. When re-enabling an interrupt in sys_irq_control, also clear its pending flag. I've done this in systems in the past. I have a vague concern that this could cause you to miss interrupts.
  2. Allow custom ISRs, like the overridden one you wrote for TIM1, to call into the default ISR behavior before/after performing some action. So you could poke the timer to clear the interrupt condition and then cause the interrupt to be handled by the default handler. This option is the most general, but would also require the most boilerplate in an application.
  3. Provide a clear-pending syscall (possibly as an argument to sys_irq_control). If you're confident that you want to turn an IRQ back on and it's OK to have potentially missed intervening IRQs -- in this case, because you can't distinguish them from the IRQ being continuously asserted -- you could use this to prevent surprise double interrupts. What you'd probably do in your case is to clear CC4IF and then use this clear-pending operation. My main concern about this is that, unlike code written in a (properly prioritized) ISR, this is task code that could run much later than you'd expect and miss actual interrupt sources. But if tasks are properly prioritized, this is probably not a risk.

@timblakely
Copy link
Author

timblakely commented May 11, 2022

Thanks! Again, absolutely zero hurry from my end. Congrats on the hardware delivery! :D

IIRC the NVIC does not set the Pending bit in response to an IRQ while that IRQ's handler is active, but will set it as soon as the handler is not active if the IRQ condition has not been cleared. Because we disable the interrupt in it (default) handler, the Pending state will not immediately percolate into Active state until we re-enable the interrupt.

Yup, that exactly jives with what I'm seeing. The interrupt would immediately try to trigger again, but since Hubris sets all IRQs at a lower priority than PendSV and SVCall, it goes pending when DefaultHandler sets PENDSVSET.

I've thought about three alternative approaches to this -- there may be others.

All these sound reasonable approaches, and all would work for my use case, though it always makes me itchy when the kernel calls into "arbitrary user code", especially in a privileged context like an ISR.

Could there be some way to define how to acknowledge the additional peripheral IRQ from app.toml somehow, considering that's where they're defined...?

@timblakely
Copy link
Author

timblakely commented May 17, 2022

Regarding 2 above, luckily since DefaultHandler is pub it appears this can be as simple as:

#[interrupt]
unsafe fn TIM_CC() {
    // Clear the hardware timer directly.
    let timer = &*device::TIM1::ptr();
    timer.sr.write(|w| w.cc4if().clear());
    // Call into Hubris' default IRQ handler and let it take care of the rest.
    kern::arch::DefaultHandler();
}

While this is inherently unsafe and it would be nice if there were a low(no? :)-overhead way of getting Hubris to do this automagically, this doesn't seem to add too much boilerplate and only a single small unsafe fn. Of course this may be architecture-dependent given we're reaching into kern::arch, so not sure how much this hurts portability...

@cbiffle
Copy link
Collaborator

cbiffle commented May 20, 2022

...huh. I didn't realize DefaultHandler is pub. So, you can do 2 today. That's nice.

so not sure how much this hurts portability...

I have yet to see a portable ISR. :-) I don't think this hurts your portability in the least -- if you wanted to make the jump to (say) a RISC-V core, or an ARMv7-A core, you'd be dealing with different timers and a different interrupt handler model regardless.

@timblakely
Copy link
Author

Hah, that's... a really good point. With that in mind, seems like calling kern::arch::DefaultHandler "manually" like this actually allows for a pretty flexible-but-low-overhead way of flipping these hardware bits. It leaves it up to the user whether the bit is acknowledged before or after the DefaultHandler processing, and since it's done in the originally triggered interrupt AFAICT there's no chance of missing an interrupt like the other potential workarounds.

The only remaining quirk is that it requires these #[interrupt]s to be defined in the kernel. I think this comes down to a combination of 1) the way that cortex-m-rt sets up the VECTOR_TABLE in its linker script, and 2) the fact that compilation is completely separate per-task basis so any interrupt handlers defined in tasks don't get fed to the kernel linking process. Not a deal breaker really and actually somewhat of A Good Thing ™️ in that it keeps privileged contexts/code limited to the kernel crate. Of course it does mean that you can't have a fully-self-contained/update-the-app.toml-and-go Task crate that uses any level-triggered hardware since the ISR to clear it would have to be defined in the kernel. Would be nice to be able to define them in the various task crates, but that would require some seemingly invasive surgery to build/xtask/src/dist.rs; not sure if it's worth the effort for now.

All that is to say calling into kern::arch::DefaultHandler() actually seems like it'd work flawlessly for my needs, and seems flexible enough to work for quite a few other use cases as well! :D Feel free to close if you'd want to track this in a bug that's more narrow/focused in scope.

@cbiffle cbiffle added this to the Unscheduled milestone Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants