-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sys_heap: a new/simpler/faster memory allocator #17628
Conversation
The existing mem_pool implementation has been an endless source of frustration. It's had alignment bugs, it's had racy behavior. It's never been particularly fast. It's outrageously complicated to configure statically. And while its fragmentation resistance and overhead on small blocks is good, it's space efficiencey has always been very poor due to the four-way buddy scheme. This patch introduces sys_heap. It's a more or less conventional segregated fit allocator with power-of-two buckets. It doesn't expose its level structure to the user at all, simply taking an arbitrarily aligned pointer to memory. It stores all metadata inside the heap region. It allocates and frees by simple pointer and not block ID. Static initialization is trivial, and runtime initialization is only a few cycles to format and add one block to a list header. It has excellent space efficiency. Chunks can be split arbitrarily in 8 byte units. Overhead is only four bytes per allocated chunk (eight bytes for heaps >256kb or on 64 bit systems), plus a log2-sized array of 2-word bucket headers. No coarse alignment restrictions on blocks, they can be split and merged (in units of 8 bytes) arbitrarily. It has good fragmentation resistance. Freed blocks are always immediately merged with adjacent free blocks. Allocations are attempted from a sample of the smallest bucket that might fit, falling back rapidly to the smallest block guaranteed to fit. Split memory remaining in the chunk is always returned immediately to the heap for other allocation. It has excellent performance with firmly bounded runtime. All operations are constant time (though there is a search of the smallest bucket that has a compile-time-configurable upper bound, setting this to extreme values results in an effectively linear search of the list), objectively fast (about a hundred instructions) and amenable to locked operation. No more need for fragile lock relaxation trickery. It also contains an extensive validation and stress test framework, something that was sorely lacking in the previous implementation. Note that sys_heap is not a compatible API with sys_mem_pool and k_mem_pool. Wrappers for those (now-) legacy APIs appear later in this patch series. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
Use the white box validation and test rig added as part of the sys_heap work. Add a layer that puts hashed cookies into the blocks to detect corruption, check the validity state after every operation, and enumerate a few different usage patterns: + Small heap, "real world" allocation where the heap is about half full and most allocations succeed. + Small heap, "fragmentation runaway" scenario where most allocations start failing, but the heap must remain consistent. + Big heap. We can't test this with the same exhaustive coverage (many re/allocations for every byte of storage) for performance reasons, but we do what we can. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
*/ | ||
|
||
/* Note: the init_mem/bytes fields are for the static initializer to | ||
* have somewhere to put the arguments. The actual heap metadata at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... forgot to remove this bit. Ignore this comment and those fields. I yanked the static initializer from this version entirely because the semantics of SYS_MEM_POOL_DEFINE is a little too weird to wrap anyway.
Found the following issues, please fix and resubmit: checkpatch issues
|
OK. I have a truckload of comments but overall I really like it.
Let's see how I can convey that to github.
|
REview part 1.
lib/os/heap.h:
```
struct z_heap {
u64_t *buf;
struct z_heap_bucket *buckets;
u32_t len;
u32_t size_mask;
u32_t chunk0;
u32_t avail_buckets;
};
```
The buf field is always equivalent to the address of this structure
instance and is redundant. You could do:
```
static inline u64_t heap_buf(struct z_heap *h) { return (u64_t *)h; }
```
and spare one pointer.
Also `struct z_heap_bucket *buckets` could be moved last and turned into
a zero-length array which would save another pointer. There is no need
to chunk-align it.
You may also get rid of size_mask. I suggest that you do:
```
static inline chunkid_t used(struct z_heap *h, chunkid_t c)
{
return (chunk_field(h, c, SIZE_AND_USED) & 1) != 0;
}
static ALWAYS_INLINE chunkid_t size(struct z_heap *h, chunkid_t c)
{
return chunk_field(h, c, SIZE_AND_USED) >> 1;
}
static inline void chunk_set_used(struct z_heap *h, chunkid_t c,
bool used)
{
chunk_set(h, c, SIZE_AND_USED,
((size(h, c) << 1) | (used ? 1 : 0));
}
```
This will dispense with size_mask and generate simpler code. The
`ALWAYS_INLINE` might then become unnecessary.
Then, to avoid mistakes, the setting of thesize should use an accessor
too.
Also, `used()` should return a bool.
```
static inline bool big_heap(struct z_heap *h)
{
return sizeof(size_t) > 4 || h->len > 0x7fff;
}
```
Here `sizeof(size_t) > 4` is wrong. That should be `sizeof(void *) > 4`
as there is no guarantee that size_t is always 64-bit wide on 64-bit
targets. Would also be a good idea to comment that this is used to
enforce 64-bit alignment on returned memory.
As an optimization, it might be a good idea to add a Kconfig option for
allowing heaps larger than 0x7fff on 32-bit targets (with a default to
y). When unset, `big_heap()` would always return false and the
conditional code in `chunk_field()` and `chunk_set()` would get
optimized away at compile time. No need to make such Kconfig prompt
visible on 64-bit as `big_heap()` is always true in that case.
I'd suggest renaming `chunk_field()` to `chunk_get()` so to be analogous
to `chunk_get()`.
Maybe `size()` and `left_size()` should return `size_t` not `chunkid_t`.
In `chunksz()` I'd suggest using `ROUND_UP(bytes, CHUNK_UNIT)` or even
use that directly in the code in place of `chunksz()` to make the code
more obvious.
About `bucket_idx()`: there is a `u32_count_leading_zero()` provided by
math_extras_impl.h that probably ought to be used instead of
`__builtin_clz()`. Same deal for the `__builtin_ctz()` used elsewhere.
That's it for now. Part 2 to come later.
|
Yeah, there's some space for optimization for sure. The "buf" and "buckets" pointers are in the struct still because it started out as a separate struct and moved into the heap memory late in the process. With the precomputed size_mask, that was definitely a win when I looked at it originally, though that may have changed. I think the bigger problem there is that the bit is on the wrong field, the LEFT_SIZE field is used much more infrequently and should be the one polluted with extra ops. (The reason is that they both had high bits at one point in an abandoned misdesign). Most of the rest of the notes make sense. Will address when I get back to this. Though the first priority here is to get the mem_pool wrapper work (in some form -- it's looking more complicated than I thought to make it switchable between backends) up. (And honestly, the timer work in #17155 has a much better chance of landing for 2.0 and I really need to be doing that first. It was only the fear of endlessly codereviewing mem_pool bugs that scared me into getting this into code so soon...) |
Review part 2.
lib/os/heap.c:
```
static void *chunk_mem(struct z_heap *h, chunkid_t c)
{
u8_t *ret = ((u8_t *)&h->buf[c]) + chunk_header_bytes(h);
CHECK(!(((size_t)ret) & (big_heap(h) ? 7 : 3)));
return ret;
}
```
That `size_t` should be `uintptr_t`.
```
static void free_list_remove(struct z_heap *h, int bidx,
chunkid_t c)
{
struct z_heap_bucket *b = &h->buckets[bidx];
CHECK(!used(h, c));
CHECK(b->next != 0);
CHECK(b->list_size > 0);
CHECK((((h->avail_buckets & (1 << bidx)) == 0)
== (h->buckets[bidx].next == 0)));
```
The 4th check looks wrong. The `h->buckets[bidx].next` is the same as
`b->next` which was just tested not to be 0. So that should probably be
just `CHECK(h->avail_buckets & (1 << bidx));`.
In `free_list_add()` the last check is:
```
CHECK(h->avail_buckets & (1 << bucket_idx(h, size(h, c))));
```
The `bucket_idx()` is quite costly on CPUs that lack a CLZ instruction.
Would be better to simply use `b` here not to make validation checks
more costly than necessary.
Also, `b` is used as a `struct z_heap_bucket *` in other functions, so
that would make the code clearer if `b` was renamed to `bi` or `bidx` here.
```
static ALWAYS_INLINE bool last_chunk(struct z_heap *h, chunkid_t c)
{
return (c + size(h, c)) == h->len;
}
```
I'd suggest adding a `CHECK(c + size(h, c) <= h->len)` here.
In `split_alloc()` there is:
```
if (rem >= (big_heap(h) ? 2 : 1)) {
```
That's the second time those magic 2 and 1 values are used, this time
with no justification. Would be best to abstract that into some
`min_chunksize()` or the like.
The open coded `c != h->chunk0` could be abstracted into some
`!first_chunk(h, c)` for symetry.
In `sys_heap_alloc()`:
```
for (int i = 0; i < loops; i++) {
CHECK(b->next != 0);
if (size(h, b->next) >= sz) {
return split_alloc(h, bi, sz);
} else {
b->next = free_next(h, b->next);
}
}
```
It might be a good idea to remember where the loop started and break
early when there are fewer chunks than the loop count. And perhaps
remember the best fit as well given the loop count is quite small.
```
void sys_heap_init(struct sys_heap *heap, void *mem, size_t bytes)
{
/* Must fit in a 32 bit count of u64's */
#if __SIZEOF_SIZE_T__ > 4
CHECK(bytes < 0x800000000ULL);
#endif
```
Might be best to use `if (sizeof(size_t) > 4)` here to be more
"standard" code. The compiler will optimize it away on 32-bit targets
anyway, but it'll validate the syntax in all cases.
The rest of the function has potential for simplifications following
comments in my part 1.
|
On Thu, 18 Jul 2019, Andy Ross wrote:
With the precomputed size_mask, that was definitely a win when I
looked at it originally, though that may have changed. I think the
bigger problem there is that the bit is on the wrong field, the
LEFT_SIZE field is used much more infrequently and should be the one
polluted with extra ops. (The reason is that they both had high bits
at one point in an abandoned misdesign).
Agreed on using LEFT_SIZE. However it is more efficient if bits are to
be found at a constant offset which is achieved by moving the flag bit
down.
Most of the rest of the notes make sense. Will address when I get
back to this. Though the first priority here is to get the mem_pool
wrapper work (in some form -- it's looking more complicated than I
thought to make it switchable between backends) up.
What about something like:
```
#define K_MEM_HEAP_DEFINE(name, sz) \
char __aligned(CHUNK_UNIT) _heap_buf_##name[ROUND_UP(sz, CHUNK_UNIT)]; \
struct z_heap name ALIAS_OF(_heap_buf_##name); \
Z_STRUCT_SECTION_ITERABLE(k_heap_init, _heap_init_##name) = { \
.mem = _heap_buf_##name, \
.size = ROUND_UP(sz, CHUNK_UNIT), \
}
#define K_MEM_POOL_DEFINE(name, minsz, maxsz, nmax, align) \
__K_MEM_POOL_HEAP_DEFINE(name, ROUND_UP(maxsz, CHUNK_UNIT) * (nmax))
#define __K_MEM_POOL_HEAP_DEFINE(name size) \
K_MEM_HEAP_DEFINE(name, \
(size) + sizeof(struct z_heap) + HEAP_BUCKET_SIZE(size))
#define HEAP_BUCKET_SIZE(size) \
(const_ilog2((8 + (size)) / CHUNK_UNIT) * sizeof(struct z_heap_bucket))
#define const_ilog2(x) \
( \
(x) < 2 ? 0 : \
(x) < 2 ? 0 : \
(x) < 4 ? 1 : \
(x) < 8 ? 2 : \
[...]
```
Then you only need to initialize the heap at run time.
(And honestly, the timer work in #17155 has a much better chance of
landing for 2.0 and I really need to be doing that first. It was only
the fear of endlessly codereviewing mem_pool bugs that scared me into
getting this into code so soon...)
Hey, I reviewed and identified all the remaining bugs in that code for
you. So you shouldn't worry that much. ;-)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologize, but at the moment I have no chance to do detailed code review. Instead, I would like to point to some high-level aspects:
- We are staring to see SoCs with data caches in the Zephyr. Since we are doing new allocator, we should include this aspect in the design. The minimum requirement here is to allocate chunks which are aligned to cache line size (usually 16, 32, or 64 bytes) both in address and size. Ideally such configuration should be done per heap, as this is necessary only for memory which will be later accessed through DMA.
- IMHO we should create some debugging infrastructure which allow the users do detect common errors related to memory allocation. I suggest the following:
- We should check if the pointer being freed is the pointer returned by
sys_heap_alloc()
and belongs to the same heap (this should be easy to do using pointer arithmetic, checking the alignment and data structure describing given chunk). - We should detect double-free scenarios.
- We should add head and tail guard areas to check for access beyond allocated region.
[ For those not familiar with the concept: The memory below and above allocated chunk is filled by distinctive pattern (like0xDEADC0DE
) which is checked periodically and/or during free. If the pattern is altered, you wrote out of the bonds. If the pattern propagated somewhere causing crash, you know that it origins from read beyond allocated area. ] - We should support data trashing to detect use-after-free scenarios.
[ For those not familiar with the concept: The freed memory is filled with distinctive pattern (different than the one used in guards). If you see the pattern somewhere, you will know its origin. Also the pattern is verified periodically and/or during allocation, so a write to the freed region could be detected. ] - Ideally we should be able to trace allocation point of each chunk (file, line, function). Such information together with CLI interface for heap state inspection will be a great tool helping to fight with memory leaks. Moreover, the CLI interface dumping the state of the heap could be also helpful during development.
- We should check if the pointer being freed is the pointer returned by
* | ||
* @param h Heap from which to allocate | ||
* @param bytes Number of bytes requested | ||
* @return Pointer to memory the caller can now use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO it would be good to add information about alignment of allocated memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
normal malloc returns a pointer that is "suitably aligned for any kind of variable". I think we should to, otherwise there will be a problems
* @param free Callback to perform a free of a pointer returned from | ||
* @a alloc. Passes back the @a arg parameter as a | ||
* context handle. | ||
* @param arg Context handle to pass back to the callbacks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change arg
to context
.
On Fri, 19 Jul 2019, Piotr Zięcik wrote:
- We are staring to see SoCs with data caches in the Zephyr. Since we are doing new allocator, we should include this aspect in the design. The minimum requirement here is to allocate chunks which are aligned to cache line size (usually 16, 32, or 64 bytes) both in address and size. Ideally such configuration should be done per heap, as this is necessary only for memory which will be later accessed through DMA.
That'll be very easy to do. Right now this is hardcoded to CHUNK_UNIT
but that can be made into a variable later.
The debugging infrastructure you suggest can also be added on top of
this design without major issues. Guard patterns can be factored in by
`chunk_header_bytes()` etc. But as there is no such facility in the
existing allocator at the moment anyway, I'd suggest this be done
separately.
|
bool emptybit = (h->avail_buckets & (1 << bidx)) == 0; | ||
bool emptylist = b->next == 0; | ||
bool emptycount = b->list_size == 0; | ||
bool empties_match = emptybit == emptylist && emptybit == emptycount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parenthesis are not needed here, but very good to have. They will speak your true intentions over guessed ones in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First comments. The biggest problem may be with locking. The former scheme assumed poll maintains the lock so can unlock irqs while in loops. Now we would have to lock entire operation on k_malloc level.
* sys_heap_alloc such that it can be used for other purposes. The | ||
* caller must not use the memory region after entry to this function. | ||
* | ||
* @note The sys_heap implementation is not internally synchronized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would mean lock should be taken by k_malloc
* running one and corrupting it. YMMV. | ||
*/ | ||
|
||
static size_t max_chunkid(struct z_heap *h) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const struct z_heap *h
when h
is read-only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also in other places
return ret; | ||
} | ||
|
||
static void free_list_remove(struct z_heap *h, int bidx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size_t bidx
(or unsigned in
) as this cannot be negative
b->list_size--; | ||
|
||
if (b->list_size == 0) { | ||
h->avail_buckets &= ~(1 << bidx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BIT(bidx)
instead of (1 << bidx)
or better
WRITE_BIT(h->avail_buckets, bidx, 0);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also in other places
{ | ||
int b = bucket_idx(h, size(h, c)); | ||
|
||
if (h->buckets[b].list_size++ == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works but mixing increment with other operators in an expression is very bug prone.
{ | ||
/* Must fit in a 32 bit count of u64's */ | ||
#if __SIZEOF_SIZE_T__ > 4 | ||
CHECK(bytes < 0x800000000ULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bytes <= UINT32_MAX
, bytes <= INT32_MAX
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 0x800000000ULL
?
* Note: this cannot catch every possible error, but if it returns | ||
* true then the heap is in a consistent state and can correctly | ||
* handle any sys_heap_alloc() request and free any live pointer | ||
* returned from a previou allocation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previou
* failures and a very fragmented heap. | ||
* @param result Struct into which to store test results. | ||
*/ | ||
void sys_heap_stress(void *(*alloc)(void *arg, size_t bytes), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if this is a good place to have that api, after all it is API for testing (same with sys_heap_validate
). What about having sys_heap_validate.h
or something similar and not spoil this header.
|
||
#define CHUNK_UNIT 8 | ||
|
||
enum chunk_fields { SIZE_AND_USED, LEFT_SIZE, FREE_PREV, FREE_NEXT }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some explanation here would be good.
return sizeof(size_t) > 4 || h->len > 0x7fff; | ||
} | ||
|
||
static inline size_t chunk_field(struct z_heap *h, chunkid_t c, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chunk_field_get
?
} | ||
} | ||
|
||
static inline void chunk_set(struct z_heap *h, chunkid_t c, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chunk_field_set
?
} | ||
} | ||
|
||
static void free_list_add(struct z_heap *h, chunkid_t c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function explanation is missing. From what it is doing i would rather call it add_to_bucket
return ret; | ||
} | ||
|
||
static void free_list_remove(struct z_heap *h, int bidx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment missing, also i would call it remove_from_backet
.
return (chunk_field(h, c, SIZE_AND_USED) & ~h->size_mask) != 0; | ||
} | ||
|
||
static ALWAYS_INLINE chunkid_t size(struct z_heap *h, chunkid_t c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would like to see setters and getter for all those field. Now it is inconsistent. There are getters but fields are set using chunk_set
which is a bit confusing in heap.c
code.
free_list_add(h, c); | ||
} | ||
|
||
void *sys_heap_alloc(struct sys_heap *heap, size_t bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if it would make sense to have loops
as an argument instead of kconfig option. it would allow to have use wrapper like (alloc_from_isr
with lower value and alloc
which favors lower fragmentation over performance)
*/ | ||
int loops = MIN(b->list_size, CONFIG_SYS_HEAP_ALLOC_LOOPS); | ||
|
||
for (int i = 0; i < loops; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, i don't fully understand why this loop is needed. Since you have mask avail_buckets
then you need to pickup first bucket which has elements available and chunk size is equal/bigger than requested size. It seems that code below this for loop is doing that.
ok, no i see it. bucket contains chunks with size between 2^n and 2^(n+1) so there is a chance to find proper candidate. In that case maybe there should be an option to tell function to check all chunks in the smaller backet before going to the bigger bucket.
@andyross can we resurrect this PR and get it in for 2.1? |
dev-review:
|
I have some cycles to spare at the moment. So if @andyross agrees I'd be
happy to work on this and bring it forward.
|
We discussed this in a recent meeting and the idea was to try and move it forward as soon as possible, so +1 from me. |
On Mon, 30 Sep 2019, Carles Cufí wrote:
We discussed this in a recent meeting and the idea was to try and move it forward as soon as possible, so +1 from me.
I went ahead with it. I'll submit patches this week.
|
I should be able to get back on this within a week or two, but no complaints if @npitre wants to play with it. I have some early work on an an alloc_aligned implementation that doesn't require changes to the heap internals (just finds a bigger block and fragments it). Also we need to add some latency tests to track ISR-friendly performance. And the portability side is the hardest. I see three paths:
|
On Mon, 30 Sep 2019, Andy Ross wrote:
I have some early work on an an alloc_aligned implementation that
doesn't require changes to the heap internals (just finds a bigger
block and fragments it).
Good. That's also what I had in mind but didn't get around to
implementing it.
And the portability side is the hardest. I see three paths:
1. Just use this as a new API and keep the mempool one around for
existing users until we decide to deprecate it.
Long (or maybe not even so long) term this is certainly the way to go if
the new allocator proves to be superior.
2. Port only the IPC side of the APIs (the struct k_mem_block
abstration) to sit on top of this for the benefit of our (somewhat
oddball) APIs that use it.
3. Provide a full implementation (or as full as possible) of
sys_mem_pool and k_mem_pool using this as a backend. Not 100%
doable because the old code has some tested promises on alignment
and packing behavior that aren't true for generic heaps.
I implemented 2 and 3 at this point, although I still have to test it.
Alignment is simply ignored for now. I still have to look if some user
actually cares. When alignment is important then the slab
allocator is usually more suitable.
|
IIRC there are tests that do things like check that sequential allocations from an empty heap are indeed split the way it expects from the original implementation. I don't personally think those are behaviors worth preserving, but we need to find them and call them out explicitly when removing. |
This got too stale. Resubmitted in #23941 |
This is the new heap implementation and test rig I've been talking about. See commit logs and comments for design notes.
Specifically note that the heap code itself is dirt simple (mostly, once you get your head around the field accessor API), and that more than half the added code here is in the validation and test rigs. @andrewboie will be happy to see that coverage on the new files is effectively 100% (there's some cleanup to do with some of the validation lines, mostly "return false;", which obviously don't fail and don't get hit).
It passes everywhere right now except for riscv32 and nios2 (no idea yet, haven't had a chance to look).
This is just the new code. I still need to finish up wrapper APIs for sys/k_mem_pool before this can go in. I was hoping to have that done today, but it might stretch another day. And even then we'll likely leave both implementations in place and make them switchable via kconfig for a version.
On top of those, I should also turn the existing stress rig into a proper benchmark and add that so we can track performance with a real usage simulator.