-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reuse allocation for always_storage_live_locals bitset #107927
Conversation
(rustbot has picked a reviewer for you, use r? to override) |
Some changes occurred to MIR optimizations cc @rust-lang/wg-mir-opt Some changes occurred to the CTFE / Miri engine cc @rust-lang/miri |
|
||
// reuse allocation for bit set | ||
always_live_locals_cache: GrowableBitSet<mir::Local>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know good place for this, any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably the most global state you can add it to.
However note that this InterpCx is re-created per constant, so if there are many constants to evaluate, then each will still do their own thing. I am also concerned that we might keep large buffers around for longer than we have to.
@@ -1,11 +1,12 @@ | |||
use rustc_index::bit_set::BitSet; | |||
use rustc_index::bit_set::GrowableBitSet; | |||
use rustc_middle::mir::{self, Local}; | |||
|
|||
/// The set of locals in a MIR body that do not have `StorageLive`/`StorageDead` annotations. | |||
/// | |||
/// These locals have fixed storage for the duration of the body. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment should explain the unconventional signature for this function, and what the caller is supposed to pass for the locals
argument.
Note that this is a very artificial stress test benchmark. We shouldn't optimize for it blindly. Is there evidence that this helps for regular code as well? Out of all the call sites for this function, it seems like only the CTFE interpreter actually benefits from the optimization. @bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
⌛ Trying commit 9d71a6f with merge 8842f892720a42837501b09631b12efcaf14c904... |
Cc @nnethercote for some compiler perf expertise |
☀️ Try build successful - checks-actions |
1 similar comment
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
@@ -705,13 +710,19 @@ impl<'mir, 'tcx: 'mir, M: Machine<'mir, 'tcx>> InterpCx<'mir, 'tcx, M> { | |||
let mut locals = IndexVec::from_elem(dummy, &body.local_decls); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is ~60% total allocated memory for ctfe-stress-5
.
At first, i tried to cache it too, but it saved later at self.frame_mut().locals = locals;
, so i don't see easy way to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is ~60% total allocated memory for ctfe-stress-5.
Yeah that test involves a lot of CTFE stack variables.
But I don't think this is worth optimizing for. This is totally unrealistic for real code. We shouldn't make rustc harder to maintain just for the sake of meaningless benchmarks -- that is not the purpose of these stress tests.
Finished benchmarking commit (8842f892720a42837501b09631b12efcaf14c904): comparison URL. Overall result: no relevant changes - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis benchmark run did not return any relevant results for this metric. Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
|
Looks like we are not getting a perf benefit from this even on the stress test. So... looks like it doesn't actually help? Basically what you did is write a small allocator for these IndexSet (with a simple 1-element cache), and sure that means the underlying jemalloc gets called less often -- but that doesn't actually seem to translate into any benefit, the cache isn't any faster than jemalloc. Or am I missing something? |
No, result's looks boring. |
All right. Thanks for giving it a shot! |
@RalfJung: is there a better benchmark we could be using for CTFE perf evaluation? |
I am not sure what would be a benchmark that reflects real-world CTFE use. (The stress test is good to ensure that CTFE changes do not regress performance, as it will strongly overemphasize CTFE perf changes.) Maybe someone else in @rust-lang/wg-const-eval has an idea. |
This allows to reduce allocated block count about ~30% for
ctfe-stress-5
.