Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(cortex-m) unexpected kernel panic after thread exit #20812

Open
Sanderhuisman opened this issue Aug 14, 2024 · 4 comments
Open

(cortex-m) unexpected kernel panic after thread exit #20812

Sanderhuisman opened this issue Aug 14, 2024 · 4 comments
Assignees
Labels
Area: core Area: RIOT kernel. Handle PRs marked with this with care! Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)

Comments

@Sanderhuisman
Copy link

Sanderhuisman commented Aug 14, 2024

Description

I've noticed an unexpected kernel panic after a thread exited. I've traced it down to sched_switch in core/sched.c retrieving an invalid active thread. Inside sched_task_exit, the sched_active_thread pointer is being set to NULL and sched_switch does not check if the retrieved thread pointer points to a valid thread.

Steps to reproduce the issue

I have created a small application that triggers the problem (tested on an STM32 NUCLEO-F401RE, problem initially seen on an EFR32). It uses the shell module as the problem is triggered when the scheduler is invoked. After starting the application, enter a character in the console/terminal to invoke the scheduler.

Inside core/sched.c I've added an assertion to enforce the problem (without it sometimes magically goes well).

void sched_switch(uint16_t other_prio)
{
    thread_t *active_thread;
    uint16_t current_prio;
    int on_runqueue;

    active_thread = thread_get_active();
    assert(active_thread != NULL);

    current_prio = active_thread->priority;
    on_runqueue = (active_thread->status >= STATUS_ON_RUNQUEUE);

    DEBUG("sched_switch: active pid=%" PRIkernel_pid " prio=%" PRIu16 " on_runqueue=%i "
          ", other_prio=%" PRIu16 "\n",
          active_thread->pid, current_prio, on_runqueue,
          other_prio);

main.c

#include <stdint.h>
#include <stdio.h>

#include "shell.h"
#include "thread.h"

char second_thread_stack[THREAD_STACKSIZE_MAIN];

static const shell_command_t shell_commands[] = {
  {NULL, NULL, NULL},
};

void *second_thread(void *arg)
{
    (void) arg;

    puts("2nd: starting");

    puts("2nd: exiting");
    puts("Any character entered in the shell should now trigger the panic.");

    return NULL;
}

int main(void)
{
    int result = 0;

    puts("main: starting");

    kernel_pid_t main_pid = thread_create(
      second_thread_stack,
      sizeof(second_thread_stack),
      THREAD_PRIORITY_MAIN - 1,
      THREAD_CREATE_WOUT_YIELD,
      second_thread,
      NULL,
      "nr2");
    if (main_pid == -1)
    {
        puts("main: Error creating 2nd thread.");
        result = -1;
    }

    if (result == 0)
    {
        char line_buf[SHELL_DEFAULT_BUFSIZE];
        shell_run(shell_commands, line_buf, SHELL_DEFAULT_BUFSIZE);
    }

    return result;
}

Expected results

After accessing the console, I would expect the system to stay alive ;)

Actual results

After entering an enter character in the console, I get the following panic and stack trace.

> 2nd: starting
2nd: exiting
core/sched.c:288 => *** RIOT kernel panic:
FAILED ASSERTION.


ISR stack overflowed
Stack pointer corrupted, reset to top of stack
active thread: 2
FSR/FAR:
 CFSR: 0x00008200
 HFSR: 0x40000000
 DFSR: 0x00000008
 AFSR: 0x00000000
 BFAR: 0xffffffff
Misc
EXC_RET: 0xfffffff1
Inside isr -13

Potential Fix

I've changed sched_switch to include a check for active thread being valid to deal with threads having exited.

void sched_switch(uint16_t other_prio)
{
    thread_t *active_thread = thread_get_active();
    uint16_t current_prio = active_thread->priority;
    int on_runqueue = (active_thread->status >= STATUS_ON_RUNQUEUE);

    DEBUG("sched_switch: active pid=%" PRIkernel_pid " prio=%" PRIu16 " on_runqueue=%i "
        ", other_prio=%" PRIu16 "\n",
        active_thread != NULL ? active_thread->pid : KERNEL_PID_UNDEF,
        current_prio,
        on_runqueue,
        other_prio);

    if ((active_thread == NULL) || !on_runqueue || (current_prio > other_prio)) {
        if (irq_is_in()) {

I don't know if sched_switch must be able to deal with this case or if sched_task_exit shouldn't set the sched_active_thread to NULL. The comment around thread_get_active indicates the first. In that case we need to check if there are other functions that cannot deal with this case and potentially add assertions to help finding those cases in the future.

/**
 * @brief   Returns a pointer to the Thread Control Block of the currently
 *          running thread
 *
 * @return  Pointer to the TCB of the currently running thread, or `NULL` if
 *          no thread is running
 */
static inline thread_t *thread_get_active(void)
....

Versions

RIOT version: master (5267300)

Operating System Environment
----------------------------
         Operating System: "Ubuntu" "22.04.4 LTS (Jammy Jellyfish)"
                   Kernel: Linux 6.8.0-39-generic x86_64 x86_64
             System shell: /usr/bin/dash (probably dash)
             make's shell: /usr/bin/dash (probably dash)

Installed compiler toolchains
-----------------------------
               native gcc: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
        arm-none-eabi-gcc: arm-none-eabi-gcc (Arm GNU Toolchain 13.3.Rel1 (Build arm-13.24)) 13.3.1 20240614

Installed compiler libs
-----------------------
     arm-none-eabi-newlib: "4.4.0"

Installed development tools
---------------------------
                    cmake: cmake version 3.22.1
                  doxygen: 1.9.1
                      git: git version 2.39.2
                     make: GNU Make 4.3
                  openocd: Open On-Chip Debugger 0.12.0+dev-00682-gefe902219 (2024-08-13-14:06)
                  python3: Python 3.10.12
@Teufelchen1 Teufelchen1 added Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) Area: core Area: RIOT kernel. Handle PRs marked with this with care! labels Aug 16, 2024
@maribu maribu self-assigned this Aug 19, 2024
@dylad
Copy link
Member

dylad commented Aug 28, 2024

I could not reproduce this issue on p-nucleo-wb55 and nrf5340dk-app.
Could you try to use a toolchain from ARM ?

Maybe there is a specific issue on your toolchain or newlib package from Ubuntu.

@Sanderhuisman
Copy link
Author

I've used ARM GCC toolchain 13.3 (downloaded from the website, placed it in /opt/tools and added it to my path variable). Which toolchain did you use? I believe this problem only occurs when the scheduler is being triggered from an interrupt. When after exiting the one thread another busy task starts it won't occur (if I remember correct😉). Tomorrow I can perform some tests with different toolchains myself.

It could indeed be an issue with my newlib/toolchain, but this one function assumes that a pointer isn't null where by its surrounding comment it can be (and is being set to null by at least one function).

@dylad
Copy link
Member

dylad commented Aug 28, 2024

I've used ARM GCC toolchain 13.3

Ok, I've misread the reported toolchain.
That's quite strange, I have no problem with 13.3 from ARM, I've also tried 12.2 from ARM too. So far, I wasn't able to reproduce this once. Am I missing something ?

@maribu
Copy link
Member

maribu commented Aug 28, 2024

I can reproduce the issue. It occurs only when the feature no_idle_thread is used, so it effects Cortex M boards. You can work around this by disabling that feature using:

FEATURES_BLACKLIST=no_idle_thread make flash

I'll try to PR a proper fix soonish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: core Area: RIOT kernel. Handle PRs marked with this with care! Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

No branches or pull requests

4 participants