Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce usage of K_THREAD_STACK_SIZEOF macro in k_thread_create() #14269

Closed
ioannisg opened this issue Mar 11, 2019 · 21 comments · Fixed by #24714
Closed

Enforce usage of K_THREAD_STACK_SIZEOF macro in k_thread_create() #14269

ioannisg opened this issue Mar 11, 2019 · 21 comments · Fixed by #24714
Assignees
Labels
area: Architectures area: Kernel area: Memory Protection bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug

Comments

@ioannisg
Copy link
Member

Describe the bug
There is some inconsistency in how the macros related to Thread stack definition are actually user in Zephyr applications/samples/tests/subsystems. Ideally, we want to

  • define a Thread stack object using stack obj = K_THREAD_STACK_DEFINE(sym, nominal size)
  • call k_thread_create with stack_size = K_THREAD_STACK_SIZEOF(stack obj)

In this way we guarantee that k_thread_stack_create is given the amount of stack area (in bytes) that can actually be used by the thread, excluding, for instance, a privileged stack guard (in ARM builds).

However, this is not always the case: in several cases (including even the _main thread) k_thread_stack_create is passed nominal_size, instead of K_THREAD_STACK_SIZE_OF(sym). If privileged stack guards are to be used on ARM builds with requirement for power-of-two start/size alignment of stack objects, then this is a bug, because k_thread_stack_define is supplied with a size value that is larger that what the thread can actually use: this is because in that case no extra area is allocated for the guard.

In ARM builds with CONFIG_MPU_REQUIRES_POWER_OF_TWO_ALIGNMENT && USERSPACE we have a work-around for this, but it effectively eats up 32-bytes from the stack objects.

Expected behavior

  • Always call k_thread_stack_create() with argument stack_size = K_THREAD_STACK_SIZEOF(stack object), where, stack object = K_THREAD_STACK_DEFINE(nominal size)
  • Document this properly in the doc
@ioannisg ioannisg added bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug area: Kernel area: Architectures labels Mar 11, 2019
@ioannisg ioannisg added this to the v1.15.0 milestone Mar 11, 2019
@andrewboie
Copy link
Contributor

I agree there's a problem here.
I need to think about this some more before I can agree that the solution is to always pass K_THREAD_STACK_SIZEOF(nnn) to k_thread_create().

@ioannisg
Copy link
Member Author

ioannisg commented Mar 21, 2019

So, some examples for the ARMv7-M MPU case, @andrewboie , to clarify this, and show one more implication:

Assume GUARDS are used and no-workaround for power-of-two-alignment.

Example 1

  • STACKSIZE = 1024
  • K_THREAD_STACK_DEFINE(sym, STACKSIZE) --> sym[1024]
  • K_THREAD_STACK_SIZEOF(K_THREAD_STACK_DEFINE(sym, STACKSIZE)) is used to pass stack_size argument to k_thread_create
  • stack_size = 1024 - 32
  • stack_info.start = sym + 32, stack_info.size = 1024 - 32
  • MPU programs thread stack as [sym, sym + stack_info.size + stack_info.start - stack_obj] = [sym, sym + 1024 - 32 + 32] => [sym, sym+1024]

Example 2 (same as above but developer does not use K_THREAD_STACK_SIZEOF):

  • STACKSIZE = 1024
  • K_THREAD_STACK_DEFINE(sym, STACKSIZE) --> sym[1024]
  • STACKSIZE is used to pass stack_size argument to k_thread_create()
  • stack_size = 1024
  • stack_info.start = sym + 32, stack_info.size = 1024 WRONG
  • MPU programs thread stack as [sym, sym + stack_info.size + stack_info.start - stack_obj] = [sym, sym + 1024 + 32] => [sym, sym + 2048] WRONG

The above example shows that we need a workaround, if we are not sure what is passed in k_thread_create().


Assume now workaround is used for size adjustment. We repeat the examples:

Example 1-updt

  • STACKSIZE = 1024
  • K_THREAD_STACK_DEFINE(sym, STACKSIZE) --> sym[1024]
  • K_THREAD_STACK_SIZEOF(K_THREAD_STACK_DEFINE(sym, STACKSIZE)) is used to pass stack_size argument to k_thread_create
  • stack_size = 1024 - 32 - 32 (work-around)
  • stack_info.start = sym + 32, stack_info.size = 1024 - 64
  • MPU programs thread stack as [sym, sym + stack_info.size + stack_info.start - stack_obj] = [sym, sym + 1024 - 64 + 32] => [sym, sym+1024-32]

This works if the MPU rounds up to the next power-of-two
However, this will not work for thread stack with STACKSIZE=64
So, our workaround is suitable for thread stacks of min 65 bytes.

Example 2-updt (same as 1-updt but developer does not use K_THREAD_STACK_SIZEOF):

  • STACKSIZE = 1024
  • K_THREAD_STACK_DEFINE(sym, STACKSIZE) --> sym[1024]
  • STACKSIZE is used to pass stack_size argument to k_thread_create()
  • stack_size = 1024 - 32 (work-around)
  • stack_info.start = sym + 32, stack_info.size = 1024-32 CORRECT
  • MPU programs thread stack as [sym, sym + stack_info.size + stack_info.start - stack_obj] = [sym, sym + 1024 -32 + 32] => [sym, sym + 1024] CORRECT

In other words, since the work-around assumes worst-case, and this is worst-case, the result is fine.


Conclusion: we need to have a formal and deterministic way of passing the available size to the k_thread_create, so we won't need workarounds.

@ioannisg
Copy link
Member Author

One other option @andrewboie is to detect in k_thread_create() the difference between the total_size and the one derived by the object meta-data (z_object_find) and add this difference to what we pass into arm's z_new_thread.
I.e. not only perform checks in kernel/thread.c but use the value in the arch-specific code.

@andrewboie
Copy link
Contributor

andrewboie commented Mar 21, 2019

Yes, I favor the latter approach. I'd like to see if we can, for all arches, establish an equivalence, where if we have:

   #define STACKSIZE  <size>
   K_THREAD_STACK_DEFINE(my_stack, STACKSIZE)

For any value of <size> the following calls have an identical effect:

k_thread_create(&my_thread, &my_stack, STACKSIZE, ...)

k_thread_create(&my_thread, &my_stack, K_THREAD_STACK_SIZEOF(my_stack), ...)

Either of these calls should produce the same values in my_thread.stack_info when the thread is started.

If we have this setup correctly, either the common k_thread_create() code, or the arch z_new_thread() would do the necessary math to ensure that these work identically.

I'm sketching out the particulars and will post a follow-up once I think I have the details worked out.

@andrewboie
Copy link
Contributor

Based on what we have seen porting userspace to 3 arches, I think we can establish some invariant properties of thread stack objects. And I think we can make the layouts the same for all arches.

The following sizes are important:

  • REGION_GRANULARITY - For non power of two arches, this is the granularity of MPU/MMU region sizes. For example on x86 this is an MMU page size of 4K.

  • REGION_MIN - Minimum size of an MMU/MPU region. 4K for x86, 32 bytes for ARMv7 and ARC MPU version 3, 2K for ARC MPU v2, etc.

  • K_THREAD_STACK_GUARD_SIZE, aka STACK_GUARD_SIZE / MPU_GUARD_ALIGN_AND_SIZE - A fixed-size region of memory that precedes any memory that is used for a stack when the CPU is in supervisor mode. Memory used for a stack when a thread is in user mode do not need guards. The guard itself must be appropriately aligned so that an MMU/MPU region can be programmed for its boundaries, and should be of size REGION_MIN. If guards are not used, this must be of size 0. This value is #defined to different names for ARC and ARM and doesn't have its own definition for x86. We should call it K_THREAD_STACK_GUARD_SIZE to be consistent.

  • CONFIG_PRIVILEGED_STACK_SIZE - Size of stack area for performing system calls. Privileged stacks must be in memory only accessible to the kernel, but do not require MMU/MPU regions for them. The size is arbitrary, but fixed for all threads. Must be immediately preceded by a region of K_THREAD_STACK_GUARD_SIZE for its stack overflow guard. If CONFIG_USERSPACE=n, this will have a value of 0.

  • K_THREAD_STACK_RESERVED - The total amount of fixed-sized memory that is reserved inside any given stack object for the privileged stack, and the stack guard. Once we move ARM's privileged stack back into the stack object, this is equivalent to CONFIG_PRIVILEGED_STACK_SIZE + K_THREAD_STACK_GUARD_SIZE.

  • provided_stack_size - The provided stack buffer size to K_THREAD_STACK_DECLARE(). This value can be anything and is supplied by the user when the stack object is defined.

  • K_THREAD_STACK_SIZE_ADJUST() - This macro is not defined yet, but will represent the rounding done to provided_stack_size to meet the constraints such that an MPU/MMU region can be programmed to perfectly enclose its boundaries, and that a guard region may be placed after it. This must be idempotent.
    -- For power of two arches:
    #define K_THREAD_STACK_SIZE_ADJUST(size) = MAX(POW2_CEIL(size), REGION_MIN)
    -- For non power of two arches:
    #define K_THREAD_STACK_SIZE_ADJUST(size) = ROUND_UP(provided_stack_size, REGION_GRANULARITY)

  • thread.stack_info.size - The actual stack buffer size. A sub-region of the stack object. This is reported at runtime, there will be no magic macros for it. If the thread is running in user mode, it must be sized/aligned such that an MPU/MMU region can be programmed to perfectly enclose its boundaries. If the thread is running in supervisor mode, the otherwise unused privileged stack area is added to it. Its value should be a function of PROVIDED_STACK_SIZE and whether the thread is in user mode or not.
    thread.stack_info.size = K_THREAD_STACK_SIZE_ADJUST(provided_stack_size) + is_user_mode() ? 0 : CONFIG_PRIVILEGED_STACK_SIZE

  • stack_object_size - What sizeof(my_stack) returns. The total size of all memory for the stack object.
    -- sizeof(my_stack) = K_THREAD_STACK_SIZE_ADJUST(provided_stack_size) + K_THREAD_STACK_RESERVED

Now let's show the memory layout for threads in supervisor and in user mode. In this new layout (currently only followed by ARC), the location of the guard area moves depending on what mode the thread is in:

Supervisor mode:

+-----------------------------------------------------------------------+
| guard area  | stack buffer                                            |
+-----------------------------------------------------------------------+

User mode

+-----------------------------------------------------------------------+
| stack buffer                          | guard area | privileged stack |
+-----------------------------------------------------------------------+

This layout has the following properties

  • There are no carve-outs of any kind, we will not subtract any value from the requested stack size to accommodate guard memory.
  • Because the offset of the stack buffer moves depending on what privilege level the thread is in, K_THREAD_STACK_BUFFER() can't work as a macro. I simply want to delete it as an API and compel users to look at thread.stack_info for this information instead, see K_THREAD_STACK_BUFFER() is broken #14766
  • thread.stack_info gets reprogrammed when a supervisor thread drops to user mode since the buffer and guard locations change.
  • The user mode stack buffer is always at the beginning of the object, making aligning it easy for any arch regardless of whether we are power-of-two or not.

Now, how do we establish equivalence, such that we work in the same way if we have:

   #define STACKSIZE  <size>
   K_THREAD_STACK_DEFINE(my_stack, STACKSIZE)

And call either:

k_thread_create(&my_thread, &my_stack, STACKSIZE, ...)
or:
k_thread_create(&my_thread, &my_stack, K_THREAD_STACK_SIZEOF(my_stack), ...)

We can simply do:

#define K_THREAD_STACK_SIZEOF(stack) (sizeof(stack) - K_THREAD_STACK_RESERVED)

And in k_thread_create, apply stack_size = K_THREAD_STACK_SIZE_ADJUST(stack_size) before passing it to z_new_thread(). This will result in the same value being passed to z_new_thread() for either scenario.

Incidentally, this setup should also allow us to completely define a lot of these stack macros directly in kernel.h without having to define Z_ARCH_THREAD_STACK_.... macros in each arch header.

@ioannisg
Copy link
Member Author

@andrewboie I think this is a nice analysis; we should see if it is possible to implement for the next Zephyr release.

I have some high-level comments:

  • I agree completely that there is no real need of having the Privilege Stack area in separate place (we do this for ARM)
  • The macro infrastructure for defining and allocating thread stacks is good. I am stressing (also here) that we really need separate macro framework for thread that are supervisor-only. Those ones will have more relaxed alignment requirements.
  • Regarding the guards: I believe we need to rework completely the concept of the guard region: we have seen in the case of building with FP support that the guard size needs to be configurable (ARM: MPU-based HW thread stack protection not working properly when building with CONFIG_FLOAT #14828). For ARM with FP we need guards of 128 bytes, which results in a lot of SRAM wasted. Therefore, I think we need to re-design how we guard the stack-overflow of supervisor threads. See attached proposal here.

@ioannisg
Copy link
Member Author

@andrewboie
Copy link
Contributor

andrewboie commented Mar 22, 2019

I agree completely that there is no real need of having the Privilege Stack area in separate place (we do this for ARM)

Actually, I am not so sure now. The problem is base address alignment of stacks on power-of-two systems.

With the new proposed layout, consider a stack declared of size 2K. This generates a stack object of 2k + 32 bytes (guard) + 512 bytes (privileged stack) for a total of 2592.

Now consider several of these stacks defined. Each one has to have its lowest memory address aligned to 2K. I suspect the linker is going to leave 1504 bytes of wasted space in between each instance! I have not yet verified this experimentally, but need to do so before we pursue this route further. I do not believe the linker is smart enough to fill the gaps with other, smaller objects, but need to test to see what happens on ARC (which is already doing this for MPU v2 systems)

With the privileged stacks in separate memory, the 2K stacks can be packed in memory very efficiently by the linker.

@andrewboie
Copy link
Contributor

ARM_MPU_minor_rework_proposal.pdf

For supervisor threads, I like the idea of just making a guard MPU region out of whatever memory precedes the thread stack, without a specific guard region. Threads should not be sharing pointers with each other based on locations within stack frames. The only trick is what to do with the very first stack in that region, since it will not be preceded by other stacks, but some different memory.

However, for privilege elevation, in a design where the priv stack is immediately after the user stack, turning the user mode stack into a guard won't work, it's not unreasonable to expect system calls to perform memory operations on memory areas defined in the system caller's stack frame.

But as I posted earlier, having the priv and user stack be regions in the same stack object may waste tons of RAM. In a scheme where they are in separate areas of memory, we could make this work since privilege stacks will be adjacent to only other privilege stacks, with no expectations of data sharing between them.

@andrewboie
Copy link
Contributor

I am stressing (also here) that we really need separate macro framework for thread that are supervisor-only. Those ones will have more relaxed alignment requirements.

Agreed. I haven't yet come up with a good scheme to ensure calls to k_thread_create() to create user threads with one of these supervisor-only stacks fails, but I'm thinking about it.

@andrewboie
Copy link
Contributor

andrewboie commented Mar 22, 2019

How about this design, which I think has the best aspects of what we have been thinking about:

  1. All arches use privilege stack generation mechanism.
  2. Privilege stacks are stored together in a designated memory region, essentially forming an array since they are all the same size. Pbegin .... Pend
  3. Thread stacks are also stored together in a designated memory region. Sbegin ... Send
  4. Assume no sharing of pointers between different threads that exist in stack frames...we already enforce for user threads. I do recall some BT code which puts a k_sem on the stack and then waits on another thread or ISR that gives that k_sem, such code will need refactoring. See bt_hci_cmd_send_sync(). I emailed @jhedberg about it.
  5. Guards are created by setting MPU regions for memory immediately before the stack buffer being used in supervisor mode. For a privilege stack, this will be memory for the preceding privilege stack. For a supervisor thread, this will be some memory in another thread's stack buffer. We will need a way to gracefully handle the first stack, we might need a guard-sized area of dead memory, but just one.

This could work pretty well:

  • K_THREAD_STACK_BUFFER(stack) is just the stack object pointer. There's no offset. I won't need to get rid of this API like I had been considering for K_THREAD_STACK_BUFFER() is broken #14766
  • K_THREAD_STACK_SIZEOF(stack) is just sizeof(stack).
  • No dead memory carved out for guard pages.
  • Privilege stack generation code will note which stacks have been declared supervisor-only, and not generate a privilege stack for them. We could set linker symbols before and after the memory containing these stacks, and the script could just check if the stack buffer falls between them. At runtime, this will generate a fault if a user thread is started with one of these stacks, since the privilege stack lookup will fail.

@andrewboie
Copy link
Contributor

andrewboie commented Mar 22, 2019

I spoke to @jhedberg and @andyross and I think forbidding supervisor-to-supervisor stack access needs more contemplation before we commit to it.

Here's a truth table of what is accepted for access when Context A tries to access memory on Context B's stack:

Context A  | Context B  | Current | Proposed
-----------+------------+---------+---------
User       | User       |    N    |    N
Supervisor | User       |    Y    |    Y
ISR        | User       |    Y    |    Y
User       | Supervisor |    N    |    N
Supervisor | Supervisor |    Y    |    N
ISR        | Supervisor |    Y    |    Y

The change is to forbid Supervisor -> Supervisor thread stack access. We can position the ISR/stack guard in such a way to still allow ISRs to write to stacks.

@ioannisg
Copy link
Member Author

Privilege stacks are stored together in a designated memory region, essentially forming an array since they are all the same size. Pbegin .... Pend

This is fine. I believe we do this already.

@ioannisg
Copy link
Member Author

Thread stacks are also stored together in a designated memory region. Sbegin ... Send

That's fine too.

@ioannisg
Copy link
Member Author

Guards are created by setting MPU regions for memory immediately before the stack buffer being used in supervisor mode. For a privilege stack, this will be memory for the preceding privilege stack. For a supervisor thread, this will be some memory in another thread's stack buffer. We will need a way to gracefully handle the first stack, we might need a guard-sized area of dead memory, but just one.

I agree with the idea. The first stack is not the only problem: we need to make sure the guards are of sufficient size and respect the alignment requirements of the MPU architecture.

@ioannisg
Copy link
Member Author

ioannisg commented Mar 22, 2019

The change is to forbid Supervisor -> Supervisor thread stack access. We can position the ISR/stack guard in such a way to still allow ISRs to write to stacks

I agree with refactoring the code to forbit Supervisor-Supervisor thread stack access.
But I cannot immediately see how we could allow an ISRs to write to the "area" we use as a privileged guard, i.e. right below the current privileged stack. Any ideas on this, @andrewboie ?

@ioannisg
Copy link
Member Author

Privilege stack generation code will note which stacks have been declared supervisor-only, and not generate a privilege stack for them.

This is of course, ok, but this is not the only thing we need to do; we also need to not enforce special alignment requirements for these threads.

@andrewboie
Copy link
Contributor

But I cannot immediately see how we could allow an ISRs to write to the "area" we use as a privileged guard, i.e. right below the current privileged stack. Any ideas on this, @andrewboie ?

ISRs run on the interrupt stack and not the thread stack. Separate memory area.

@ioannisg
Copy link
Member Author

ISRs run on the interrupt stack and not the thread stack. Separate memory area.

I am confused here. ISRs run on the interrupt stack, yes, but they might get pointers to objects defined in the stacks of threads, no? If ISRs only access global kernel data, then, I agree, we are good. But otherwise, I don't see how we can put supervisor stacks as read only...

@andrewboie
Copy link
Contributor

I am confused here. ISRs run on the interrupt stack, yes, but they might get pointers to objects defined in the stacks of threads, no? If ISRs only access global kernel data, then, I agree, we are good. But otherwise, I don't see how we can put supervisor stacks as read only...

Oh, I see. What I was thinking was, if we can get wider agreement that it's OK to ban supervisor->supervisor thread access, to arrange the IRQ stack in memory so that its guard area would not intersect any thread stack. Treat it as a special case.

andrewboie pushed a commit to andrewboie/zephyr that referenced this issue Apr 5, 2019
This macro is slated for complete removal, as it's not possible
on arches with an MPU stack guard to know the true buffer bounds
without also knowing the runtime state of its associated thread.

As removing this completely would be invasive to where we are
in the 1.14 release, demote to a private kernel Z_ API instead.
The current way that the macro is being used internally will
not cause any undue harm, we just don't want any external code
depending on it.

The final work to remove this (and overhaul stack specification in
general) will take place in 1.15 in the context of zephyrproject-rtos#14269

Fixes: zephyrproject-rtos#14766

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
nashif pushed a commit that referenced this issue Apr 5, 2019
This macro is slated for complete removal, as it's not possible
on arches with an MPU stack guard to know the true buffer bounds
without also knowing the runtime state of its associated thread.

As removing this completely would be invasive to where we are
in the 1.14 release, demote to a private kernel Z_ API instead.
The current way that the macro is being used internally will
not cause any undue harm, we just don't want any external code
depending on it.

The final work to remove this (and overhaul stack specification in
general) will take place in 1.15 in the context of #14269

Fixes: #14766

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
@nashif nashif removed this from the future milestone May 21, 2019
@ioannisg
Copy link
Member Author

Downgrading this to LOW. Basically, because it results only in some ram waste for ARM, but no serious impact, otherwise. It leads to some inconsistencies, yes, but they are not critical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Architectures area: Kernel area: Memory Protection bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants