-
Notifications
You must be signed in to change notification settings - Fork 947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pool tracker vecs #5414
Pool tracker vecs #5414
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having these as global variables seems pretty bad and also very unnecessary here. It's unsafe and the pools can never be deallocated. It would already be a lot better to have these on the device or instance.
The implementation of those two pools themselves is likely not sound: transmuting a vec like done here is both scary and unnecessary - the pool could have been just generic over a given type.
Have you tried how performance place out when you install a custom allocator in your application, e.g. https://docs.rs/mimalloc/latest/mimalloc/
Given that for this you modified only very few locations to use these pools, we should first explore other ways of doing local pooling or avoidance of recreating these vec alltogether. Also, a breakdown of the effect of each of those sites would be helpful to better understand which optimization is worth how much etc.
But thanks for trying to address this problem, wonder if in this state this shouldn't be more of an extensive issue |
that makes sense, i put them in the device. i had to thread some lifetimes around to make it work, but it is cleaner (just to note i had to do the transmuting before because the HalApi was part of the type, so i couldn't create a static pool which included the type). if you're happy with the approach i can do the same for render bundles as well.
i haven't. i'm on windows and i remember there being a general impression that non-standard allocators were always worse, but that could well be out of date now. i can run tests but not immediately (my project is now on a fork of 19.3 and backporting this is not completely trivial). |
thanks for following up with detailed traces <3
wow 😮 ! I knew mimalloc is faster, but crazy that it makes that much difference. We should strongly consider putting that in the readme then. Despite the (with mimalloc) small savings I'm inclined to still put it in now that this iteration already looks sooo much better. The fact that putting elements in and out of the usage scope pool requires taking a lock is also irking me a lot, but it's held so briefly that it's unlikely to contend (🤞..). Not sure if we already have some lockless queue/stack in place here somewhere that we could use instead 🤔. Let's not regard that as blocking this PR. |
ah, too bad. Compelling case for the lifetime variant I reckon! Agreed, putting it on One more thing I'd like to ask you to try is to not reference the entire device but just the
yes, go ahead (: |
done
it looks like this would require an otoh |
* suspect all the future suspects * changelog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking really good now. Did a more in-depth review now, couple of small things to clean up but then we can land
RenderBundles
seem to be long-lived repeatable executions, so allocating on creation is probably fine.
ah yes, if a renderbundle isn't reused for several frames we can well regard that as a user error, so I think we can leave it as is imho.
otoh CommandBuffer uses Tracker which also allocates buffer/texture-sized vecs on instantiation, and would also need an Arc to pool
In any case I'd say we take that in a separate iteration. Not having even more arc'ing would be nice 🤷, but yes there shouldn't be all that many CommandBuffer per frame. Again, let's keep this to a separate iteration :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! 🚢
Description
when a large number of buffers/textures have ever been allocated, the cost of allocating vecs in
UsageScopes
becomes significant.with 30-40k peak buffers, i see a ~75% reduction (200us -> 50us) in the time for
CommandEncoder::begin_render_pass
from using pools to keep vecs allocated instead of reallocating each time.this implementation is bad. in particular it is unsound if multiple
HalApi
s are used in a single execution. There is probably a cleaner way to do this, maybe with a custom allocator?it also might be worthwhile to have a threshold below which we don't bother pooling to preserve micro benchmarks/low usage apps (this change is likely to be negative for a small peak number of buffers/textures).
i'm happy to make changes to clean it up, or for someone else to take the observation and redo it from scratch.
some xtest failures currently, also failing on trunk (although that doesn't have 1 slow):
Testing
i've been running on a large app. a small test case should generate a large number of buffers, discard them all and then measure the cost of
CommandEncoder::begin_render_pass
Checklist
cargo fmt
.cargo clippy
. If applicable, add:--target wasm32-unknown-unknown
--target wasm32-unknown-emscripten
cargo xtask test
to run tests.CHANGELOG.md
. See simple instructions inside file.