(cortex-m) unexpected kernel panic after thread exit #20812

Sanderhuisman · 2024-08-14T06:26:37Z

Description

I've noticed an unexpected kernel panic after a thread exited. I've traced it down to sched_switch in core/sched.c retrieving an invalid active thread. Inside sched_task_exit, the sched_active_thread pointer is being set to NULL and sched_switch does not check if the retrieved thread pointer points to a valid thread.

Steps to reproduce the issue

I have created a small application that triggers the problem (tested on an STM32 NUCLEO-F401RE, problem initially seen on an EFR32). It uses the shell module as the problem is triggered when the scheduler is invoked. After starting the application, enter a character in the console/terminal to invoke the scheduler.

Inside core/sched.c I've added an assertion to enforce the problem (without it sometimes magically goes well).

void sched_switch(uint16_t other_prio)
{
    thread_t *active_thread;
    uint16_t current_prio;
    int on_runqueue;

    active_thread = thread_get_active();
    assert(active_thread != NULL);

    current_prio = active_thread->priority;
    on_runqueue = (active_thread->status >= STATUS_ON_RUNQUEUE);

    DEBUG("sched_switch: active pid=%" PRIkernel_pid " prio=%" PRIu16 " on_runqueue=%i "
          ", other_prio=%" PRIu16 "\n",
          active_thread->pid, current_prio, on_runqueue,
          other_prio);

main.c

#include <stdint.h>
#include <stdio.h>

#include "shell.h"
#include "thread.h"

char second_thread_stack[THREAD_STACKSIZE_MAIN];

static const shell_command_t shell_commands[] = {
  {NULL, NULL, NULL},
};

void *second_thread(void *arg)
{
    (void) arg;

    puts("2nd: starting");

    puts("2nd: exiting");
    puts("Any character entered in the shell should now trigger the panic.");

    return NULL;
}

int main(void)
{
    int result = 0;

    puts("main: starting");

    kernel_pid_t main_pid = thread_create(
      second_thread_stack,
      sizeof(second_thread_stack),
      THREAD_PRIORITY_MAIN - 1,
      THREAD_CREATE_WOUT_YIELD,
      second_thread,
      NULL,
      "nr2");
    if (main_pid == -1)
    {
        puts("main: Error creating 2nd thread.");
        result = -1;
    }

    if (result == 0)
    {
        char line_buf[SHELL_DEFAULT_BUFSIZE];
        shell_run(shell_commands, line_buf, SHELL_DEFAULT_BUFSIZE);
    }

    return result;
}

Expected results

After accessing the console, I would expect the system to stay alive ;)

Actual results

After entering an enter character in the console, I get the following panic and stack trace.

> 2nd: starting
2nd: exiting
core/sched.c:288 => *** RIOT kernel panic:
FAILED ASSERTION.


ISR stack overflowed
Stack pointer corrupted, reset to top of stack
active thread: 2
FSR/FAR:
 CFSR: 0x00008200
 HFSR: 0x40000000
 DFSR: 0x00000008
 AFSR: 0x00000000
 BFAR: 0xffffffff
Misc
EXC_RET: 0xfffffff1
Inside isr -13

Potential Fix

I've changed sched_switch to include a check for active thread being valid to deal with threads having exited.

void sched_switch(uint16_t other_prio)
{
    thread_t *active_thread = thread_get_active();
    uint16_t current_prio = active_thread->priority;
    int on_runqueue = (active_thread->status >= STATUS_ON_RUNQUEUE);

    DEBUG("sched_switch: active pid=%" PRIkernel_pid " prio=%" PRIu16 " on_runqueue=%i "
        ", other_prio=%" PRIu16 "\n",
        active_thread != NULL ? active_thread->pid : KERNEL_PID_UNDEF,
        current_prio,
        on_runqueue,
        other_prio);

    if ((active_thread == NULL) || !on_runqueue || (current_prio > other_prio)) {
        if (irq_is_in()) {

I don't know if sched_switch must be able to deal with this case or if sched_task_exit shouldn't set the sched_active_thread to NULL. The comment around thread_get_active indicates the first. In that case we need to check if there are other functions that cannot deal with this case and potentially add assertions to help finding those cases in the future.

/**
 * @brief   Returns a pointer to the Thread Control Block of the currently
 *          running thread
 *
 * @return  Pointer to the TCB of the currently running thread, or `NULL` if
 *          no thread is running
 */
static inline thread_t *thread_get_active(void)
....

Versions

RIOT version: master (5267300)

Operating System Environment
----------------------------
         Operating System: "Ubuntu" "22.04.4 LTS (Jammy Jellyfish)"
                   Kernel: Linux 6.8.0-39-generic x86_64 x86_64
             System shell: /usr/bin/dash (probably dash)
             make's shell: /usr/bin/dash (probably dash)

Installed compiler toolchains
-----------------------------
               native gcc: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
        arm-none-eabi-gcc: arm-none-eabi-gcc (Arm GNU Toolchain 13.3.Rel1 (Build arm-13.24)) 13.3.1 20240614

Installed compiler libs
-----------------------
     arm-none-eabi-newlib: "4.4.0"

Installed development tools
---------------------------
                    cmake: cmake version 3.22.1
                  doxygen: 1.9.1
                      git: git version 2.39.2
                     make: GNU Make 4.3
                  openocd: Open On-Chip Debugger 0.12.0+dev-00682-gefe902219 (2024-08-13-14:06)
                  python3: Python 3.10.12

The text was updated successfully, but these errors were encountered:

dylad · 2024-08-28T08:29:40Z

I could not reproduce this issue on p-nucleo-wb55 and nrf5340dk-app.
Could you try to use a toolchain from ARM ?

Maybe there is a specific issue on your toolchain or newlib package from Ubuntu.

Sanderhuisman · 2024-08-28T09:41:13Z

I've used ARM GCC toolchain 13.3 (downloaded from the website, placed it in /opt/tools and added it to my path variable). Which toolchain did you use? I believe this problem only occurs when the scheduler is being triggered from an interrupt. When after exiting the one thread another busy task starts it won't occur (if I remember correct😉). Tomorrow I can perform some tests with different toolchains myself.

It could indeed be an issue with my newlib/toolchain, but this one function assumes that a pointer isn't null where by its surrounding comment it can be (and is being set to null by at least one function).

dylad · 2024-08-28T11:12:05Z

I've used ARM GCC toolchain 13.3

Ok, I've misread the reported toolchain.
That's quite strange, I have no problem with 13.3 from ARM, I've also tried 12.2 from ARM too. So far, I wasn't able to reproduce this once. Am I missing something ?

maribu · 2024-08-28T15:42:18Z

I can reproduce the issue. It occurs only when the feature no_idle_thread is used, so it effects Cortex M boards. You can work around this by disabling that feature using:

FEATURES_BLACKLIST=no_idle_thread make flash

I'll try to PR a proper fix soonish.

Teufelchen1 added Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) Area: core Area: RIOT kernel. Handle PRs marked with this with care! labels Aug 16, 2024

maribu self-assigned this Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(cortex-m) unexpected kernel panic after thread exit #20812

(cortex-m) unexpected kernel panic after thread exit #20812

Sanderhuisman commented Aug 14, 2024 •

edited

Loading

dylad commented Aug 28, 2024 •

edited

Loading

Sanderhuisman commented Aug 28, 2024

dylad commented Aug 28, 2024

maribu commented Aug 28, 2024

(cortex-m) unexpected kernel panic after thread exit #20812

(cortex-m) unexpected kernel panic after thread exit #20812

Comments

Sanderhuisman commented Aug 14, 2024 • edited Loading

Description

Steps to reproduce the issue

Expected results

Actual results

Potential Fix

Versions

dylad commented Aug 28, 2024 • edited Loading

Sanderhuisman commented Aug 28, 2024

dylad commented Aug 28, 2024

maribu commented Aug 28, 2024

Sanderhuisman commented Aug 14, 2024 •

edited

Loading

dylad commented Aug 28, 2024 •

edited

Loading