Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: priority inversion problem #7365

Closed
geith opened this issue Jul 14, 2017 · 19 comments
Closed

scheduler: priority inversion problem #7365

geith opened this issue Jul 14, 2017 · 19 comments
Assignees
Labels
Area: core Area: RIOT kernel. Handle PRs marked with this with care! Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)

Comments

@geith
Copy link
Contributor

geith commented Jul 14, 2017

The RIOT scheduler seems to have no protection against the priority inversion problem where a task of medium priority can prevent the execution a higher prioritized task when that waits for a resource allocated by a low prioritized task. The attached code example demonstrated this behavior.

A common solution for this is a for this is priority inheritance mechanism, where owning task temporary inherits the priority of the waiting task. But there might be other solutions too.

Even though this problem can be mostly avoided in the application design, it would be an feature which can avoid critical complications and simplifies the design of complex applications.
An famous example for this problem is the Pathfinder Mission which got stuck in a reboot loop (triggered by the watchdog timer) and exceeded its energy budget. In that case, that could be fixed by remotely enabling the priority inheritance feature.

In the example below, the shared resource represented by the mutex res_mtx which is used by a low and a high priority thread. A medium priority thread starts working after 3 s prevents the low priority thread from freeing the mutex and thus, the high priority thread from being scheduled.

#include <stdio.h>
#include <string.h>
#include <stdint.h>

#include "thread.h"
#include "mutex.h"
#include "xtimer.h"

mutex_t res_mtx;

char stack_high[THREAD_STACKSIZE_MAIN];
char stack_mid[THREAD_STACKSIZE_MAIN];
char stack_low[THREAD_STACKSIZE_MAIN];

void *t_low_handler(void *arg)
{
  /* starting working loop immediately */
  while(1){
    printf("t_low: allocating resource...\n");
    mutex_lock(&res_mtx);
    printf("t_low: got resource.\n");
    xtimer_sleep(1);

    printf("t_low: freeing resource...\n");
    mutex_unlock(&res_mtx);
    printf("t_low: freed resource.\n");
    xtimer_sleep(1);
  }
  return NULL;
}

void *t_mid_handler(void *arg)
{
  /* starting working loop after 3s */
  xtimer_sleep(3);

  printf("t_mid: doing some stupid stuff...\n");
  while(1){
    thread_yield_higher();
  }
}

void *t_high_handler(void *arg)
{
  /* starting working loop after 500 ms */
  xtimer_usleep(500);
  while(1){
    printf("t_high: allocating resource...\n");
    mutex_lock(&res_mtx);
    printf("t_high: got resource.\n");
    xtimer_sleep(1);

    printf("t_high: freeing resource...\n");
    mutex_unlock(&res_mtx);
    printf("t_high: freed resource.\n");
  }
  return NULL;
}

kernel_pid_t pid_low;
kernel_pid_t pid_mid;
kernel_pid_t pid_high;

int main(void)
{
  xtimer_init();
  mutex_init(&res_mtx);
  puts("This is a scheduling test for Priority Inversion");

  pid_low = thread_create(stack_low, sizeof(stack_low),
      THREAD_PRIORITY_MAIN - 1,
      THREAD_CREATE_STACKTEST,
      t_low_handler, NULL,
      "t_low");

  pid_mid = thread_create(stack_mid, sizeof(stack_mid),
      THREAD_PRIORITY_MAIN - 2,
      THREAD_CREATE_STACKTEST,
      t_mid_handler, NULL,
      "t_mid");

  pid_high = thread_create(stack_high, sizeof(stack_high),
      THREAD_PRIORITY_MAIN - 3,
      THREAD_CREATE_STACKTEST,
      t_high_handler, NULL,
      "t_high");

  thread_sleep();
  return 0;
}
@jnohlgard
Copy link
Member

I understand the statement and I agree that this is a problem.
@geith would you like to create a pull request with your example code in a test application (for example create a new tests/thread_priority_inversion based on tests/thread_basic), to make it easier for others to check out and see the problem for themselves? I guess it makes sense to keep this program in the repo as a demonstration of the priority inversion problem, and that it is working after we have created a solution.

@haukepetersen
Copy link
Contributor

Thanks for pointing this out, this is indeed a known problem. As you also stated, RIOT does so far move this problem onto the application developers, offering no means of protection against priority inversion inside the core/scheduler.

I agree with @gebart here: how about we add the code you provided as a test application (maybe adding some documentation that this test will always fail using RIOTs default kernel configuration). Then based on that, we could look into means we can put (as optional option) into the kernel, to enable e.g. priority inheritance or priority ceiling or similar.

@OlegHahm OlegHahm added the Area: core Area: RIOT kernel. Handle PRs marked with this with care! label Jul 31, 2017
@haukepetersen
Copy link
Contributor

I gave this some more thought and started to play around a little bit with mutexes, using the example above as 'benchmark' -> see https://github.com/haukepetersen/RIOT/tree/add_core_prioinheritance

I thought that there might be an easy solution to integrate priority inheritance into the mutex code, but as I found out, it seems not that easy. The problem I currently look it is caused by mutexes being stacked (e.g. a thread locks more than one at the same time). In the example above this happens implicitly through xtimer_sleep, which uses a second mutex internally. In that case, my first try will first increase the prio of t_min to the prio of t_high, but this is reset with the first call to mutex_unlock, done in the xtimer. So it seems we need a more precise tracking of which mutex caused a thread's priority change - so looking for efficient ways to do this...

@haukepetersen
Copy link
Contributor

@geith: While I was looking into this, I PRed an adapted version of your test application: #7444

@bergzand
Copy link
Member

bergzand commented Aug 4, 2017

Does a thread have a list of all acquired locks? If so, the priority could be reset to the highest lending priority among locked mutexes. The original thread priority would only have to be registered with the thread.

@haukepetersen
Copy link
Contributor

nope, that information is not available. But I think I found another way, PR will follow in the next hours...

@kaspar030
Copy link
Contributor

An famous example for this problem is the Pathfinder Mission which got stuck in a reboot loop

Do we have any non-theoretical, real world problem here? Pathfinder was launched twenty years ago.

IMO this only gets "solved" (e.g., by priority inversion), because it is actually doable compared to preventing all deadlocks. Why do we solve this (programming error) compared to "thread A locks mutex1, thread B locks mutex2, then A locks mutex 2, then B locks mutex 1 -> classical deadlock"?
Any solution will not come for free in terms of code size and cycles.

@haukepetersen
Copy link
Contributor

Pathfinder was launched twenty years ago.

But that doesn't mean, that programming errors like that are not made anymore... And why not give the developer something to his/her hand to make an application potentially more error proof?!

Any solution will not come for free in terms of code size and cycles

That's for sure! But adding some priority inversion prevention as an optional module would not hurt, right?!

@kaspar030
Copy link
Contributor

But adding some priority inversion prevention as an optional module would not hurt, right?!

Having it not only does not hurt, but would be nice. I'm just asking myself if we're solving a problem that noone actually encountered in the wild. So why invest time? (unless it's fun! :) )

@haukepetersen
Copy link
Contributor

noone actually encountered in the wild

actually I believe this is encountered quite a bit in the wild and is not something that was a problem only 100 years ago. I think the reason one does not hear about this very often anymore, is that all major real-time OSes do have priority inversion prevention integrated by default/as module... (-> see freertos, zephyr, mynewt, ...).

@bergzand
Copy link
Member

bergzand commented Aug 4, 2017

I've always been taught priority inheritance as a way to minimize blocking time of higher priority tasks for better real time guarantees.

@kaspar030
Copy link
Contributor

actually I believe

Well, I don't. :) Any references to this being a real problem?

I've always been taught priority inheritance as a way to minimize blocking time of higher priority tasks for better real time guarantees.

True that.

Alright, didn't want to discourage fixing this, carry on guys!

@haukepetersen
Copy link
Contributor

Fixed it, just preparing the PR...

@haukepetersen
Copy link
Contributor

done #7445

@stale
Copy link

stale bot commented Aug 10, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions.

@stale stale bot added the State: stale State: The issue / PR has no activity for >185 days label Aug 10, 2019
@stale stale bot closed this as completed Sep 10, 2019
@aabadie aabadie added the Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) label Sep 21, 2019
@aabadie aabadie reopened this Sep 21, 2019
@stale stale bot removed the State: stale State: The issue / PR has no activity for >185 days label Sep 21, 2019
@kaspar030 kaspar030 removed the Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) label Sep 21, 2019
@kaspar030
Copy link
Contributor

This is not a bug, but a design choice.

@stale
Copy link

stale bot commented Mar 24, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions.

@stale stale bot added the State: stale State: The issue / PR has no activity for >185 days label Mar 24, 2020
@jia200x
Copy link
Member

jia200x commented Apr 3, 2020

Here's a real world case where priority inversion is indeed a problem: #13573 (comment)

@stale stale bot removed the State: stale State: The issue / PR has no activity for >185 days label Apr 3, 2020
@miri64 miri64 added the Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) label Jul 1, 2020
@miri64 miri64 added this to the Release 2020.07 milestone Jul 1, 2020
@MrKevinWeiss MrKevinWeiss removed this from the Release 2021.07 milestone Jul 15, 2021
@maribu
Copy link
Member

maribu commented Sep 16, 2022

Fixed by #17040

@maribu maribu closed this as completed Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: core Area: RIOT kernel. Handle PRs marked with this with care! Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

No branches or pull requests