Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible problem in scheduler #5769

Closed
melshuber opened this issue Aug 23, 2016 · 11 comments
Closed

Possible problem in scheduler #5769

melshuber opened this issue Aug 23, 2016 · 11 comments
Assignees
Labels
Area: core Area: RIOT kernel. Handle PRs marked with this with care! Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)

Comments

@melshuber
Copy link

I am not totally sure, but i think there might be a Problem in sched_set_status.
https://github.com/RIOT-OS/RIOT/blob/master/core/sched.c#L130-L150
runqueue_bitcache is not accessed atomically, i think it should.

e.g.: runqueue_bitcache |= 1 << process->priority

I found out that in our project sometimes one of our thread hung. I check for deadlocks but did not found any, so i locked up the stack of the threads, and saw that the thread stopped somewhere in msg_receive. The thread state was pending but, was not scheduled anymore.

Finally I found that that:

  • the thread was queued in the runqueue of the respective priority, and
  • that the corresponding bit in runqueue_bitcache was not set.

I used gdb to set the bit in runqueue_bitcache manually, and the system got kick started again.

So something introduced an inconsistency between the runqueues and the bitcache.
I think the reason is that runqueue_bitcache is not accessed atomically

I already prepared a fix you can review:

@kaspar030 kaspar030 added the Area: core Area: RIOT kernel. Handle PRs marked with this with care! label Aug 23, 2016
@kaspar030 kaspar030 self-assigned this Aug 23, 2016
@kaspar030
Copy link
Contributor

I think the reason is that runqueue_bitcache is not accessed atomically

The access to 'runqueue_bitcache' should be protected by disabling
interrupts.

saw that the thread stopped somewhere in msg_receive

Can you pinpoint where exactly? Is the thread "STATUS_RECEIVE_BLOCKED"?

What platform are you working on?

@melshuber
Copy link
Author

Thanks for the Info,

It is true - atomic access is not enough.
I checked again and i think that all calls to sched_set_status seem to be protected by disableIRQ.

The hung thread (a network interface) has STATUS_PENDING - at least ps told me.
I stepped multiple times through sched_run and confirmed that the thread was queued in the correct run-queue. Other threads are still active.

The threads msg_queue full (no surprise), and another thread (ipv6) was blocked (msg_send_receive) because it wanted to communicate with the hung network interface thread.

I am using, an stm32f1 variant with an CortexM3 on a custom PCB.

Note that I am using release 2015.09 - and I currently cannot rebase onto a newer RIOT release.
Do you know if something relevant has changed, since.

When this problem occurs again I try to pinpoint more exactly where the thread hangs.

best
Martin

@miri64
Copy link
Member

miri64 commented Oct 18, 2016

@kaspar030 ping?

@kaspar030
Copy link
Contributor

@melshuber Did you encounter this again?

@melshuber
Copy link
Author

@kaspar030: not yet, but we are currently running on a patch which replaces bit operations on runqueue_bitcache by atomic_[set|clr]_bit.

I know thats not the correct solution, but I have not yet not found the time to dig into this.

@smlng
Copy link
Member

smlng commented Jul 6, 2017

any news here?

@melshuber
Copy link
Author

not yet, but I am currently not working only part-time on this project

@stale
Copy link

stale bot commented Aug 10, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions.

@stale stale bot added the State: stale State: The issue / PR has no activity for >185 days label Aug 10, 2019
@stale stale bot closed this as completed Sep 10, 2019
@aabadie aabadie added the Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) label Sep 21, 2019
@aabadie aabadie reopened this Sep 21, 2019
@stale stale bot removed the State: stale State: The issue / PR has no activity for >185 days label Sep 21, 2019
@miri64
Copy link
Member

miri64 commented Jun 30, 2020

Any progress on this?

@miri64 miri64 added this to the Release 2020.07 milestone Jun 30, 2020
@melshuber
Copy link
Author

sorry, I am no longer working on this

@miri64
Copy link
Member

miri64 commented Jul 3, 2020

Then I would close, since you yourself weren't sure about the problem either from the start, and as far as I interpret the discussion above, this wasn't reproducible. Shout if you disagree.

@miri64 miri64 closed this as completed Jul 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: core Area: RIOT kernel. Handle PRs marked with this with care! Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

No branches or pull requests

5 participants