-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move the WorkerLocal type from the rustc-rayon fork into rustc_data_structures #107782
Conversation
r? @jackh726 (rustbot has picked a reviewer for you, use r? to override) |
} | ||
|
||
// Create a dummy registry to allow `WorkerLocal` construction. | ||
// We use `OnceCell` so we only register one dummy registry per thread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of hacky due to #101313 now using WorkerLocal
outside the Rayon thread pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Atomic adds are very cheap, reverting it may even be a performance improvement, so I think that's the best option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually changing the PR
to use AttrIdGenerator(AtomicU32)
is nicer, as it avoids the global which is kind of incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. But I think there will be some regression under single thread.
|
||
/// Gets the registry associated with the current thread. Panics if there's no such registry. | ||
pub fn current() -> Self { | ||
REGISTRY.with(|registry| registry.get().cloned().expect("No assocated registry")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use registry.get_or_init(Registry::new(1))
here? If users don't explicitly call Registry::register()
, it means they wouldn't like to use functions related to thread_index
(in other word thread_index
is always 0), and it is reasonable to limit thread_limit to 1 at this time.
In this case we can get rid of the hack code below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's still quite hacky. I'd prefer threads to explicitly opt in to using WorkerLocal
. You could easily and up mixing up registries if there is multiple in use and causing panics.
Hmm... do you expect there to be more avenues for perf improvements here that cannot be done upstream? If we can avoid having more code to maintain that seems preferrable over a 0.2% improvement. |
|
@cuviper Is this something you'd want in While |
Just for reference, the performance is orthogonal to the code's location. |
The advantage to integration would just be avoiding the shadow-registry, right?
Yeah, it doesn't seem like a great fit that way, but I also can't think of how it could be any different. |
Yeah. |
/// registry. | ||
/// | ||
/// Note that there's a race possible where the identifer in `THREAD_DATA` could be reused | ||
/// so this can succeed from a different registry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the consequences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function doesn't panic. The public WorkerLocal
type does prevents that race though.
fn deref(&self) -> &T { | ||
// This is safe because `verify` will only return values less than | ||
// `self.registry.thread_limit` which is the size of the `self.locals` array. | ||
unsafe { &self.locals.get_unchecked(self.registry.id().verify()).0 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is get_unckecked
really needed? verify
will access TLS in any case, so will dominate the perf effect, won't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It reduces the panic paths from 2 to 1. TLS accesses are typically cheaper than panics. This is also not an additional proof obligation, as not only must the index be inbounds, it must refer to the correct thread.
☔ The latest upstream changes (presumably #107989) made this pull request unmergeable. Please resolve the merge conflicts. |
Factor query arena allocation out from query caches This moves the logic for arena allocation out from the query caches into conditional code in the query system. The specialized arena caches are removed. A new `QuerySystem` type is added in `rustc_middle` which contains the arenas, providers and query caches. Performance seems to be slightly regressed: <table><tr><td rowspan="2">Benchmark</td><td colspan="1"><b>Before</b></th><td colspan="2"><b>After</b></th></tr><tr><td align="right">Time</td><td align="right">Time</td><td align="right">%</th></tr><tr><td>🟣 <b>clap</b>:check</td><td align="right">1.8053s</td><td align="right">1.8109s</td><td align="right"> 0.31%</td></tr><tr><td>🟣 <b>hyper</b>:check</td><td align="right">0.2600s</td><td align="right">0.2597s</td><td align="right"> -0.10%</td></tr><tr><td>🟣 <b>regex</b>:check</td><td align="right">0.9973s</td><td align="right">1.0006s</td><td align="right"> 0.34%</td></tr><tr><td>🟣 <b>syn</b>:check</td><td align="right">1.6048s</td><td align="right">1.6051s</td><td align="right"> 0.02%</td></tr><tr><td>🟣 <b>syntex_syntax</b>:check</td><td align="right">6.2992s</td><td align="right">6.3159s</td><td align="right"> 0.26%</td></tr><tr><td>Total</td><td align="right">10.9664s</td><td align="right">10.9922s</td><td align="right"> 0.23%</td></tr><tr><td>Summary</td><td align="right">1.0000s</td><td align="right">1.0017s</td><td align="right"> 0.17%</td></tr></table> Incremental performance is a bit worse: <table><tr><td rowspan="2">Benchmark</td><td colspan="1"><b>Before</b></th><td colspan="2"><b>After</b></th></tr><tr><td align="right">Time</td><td align="right">Time</td><td align="right">%</th></tr><tr><td>🟣 <b>clap</b>:check:initial</td><td align="right">2.2103s</td><td align="right">2.2247s</td><td align="right"> 0.65%</td></tr><tr><td>🟣 <b>hyper</b>:check:initial</td><td align="right">0.3335s</td><td align="right">0.3349s</td><td align="right"> 0.41%</td></tr><tr><td>🟣 <b>regex</b>:check:initial</td><td align="right">1.2597s</td><td align="right">1.2650s</td><td align="right"> 0.42%</td></tr><tr><td>🟣 <b>syn</b>:check:initial</td><td align="right">2.0521s</td><td align="right">2.0613s</td><td align="right"> 0.45%</td></tr><tr><td>🟣 <b>syntex_syntax</b>:check:initial</td><td align="right">7.8275s</td><td align="right">7.8583s</td><td align="right"> 0.39%</td></tr><tr><td>Total</td><td align="right">13.6832s</td><td align="right">13.7442s</td><td align="right"> 0.45%</td></tr><tr><td>Summary</td><td align="right">1.0000s</td><td align="right">1.0046s</td><td align="right"> 0.46%</td></tr></table> It does seem like LLVM optimizers struggle a bit with the current state of the query system. Based on top of rust-lang#107782 and rust-lang#107802. r? `@cjgillot`
Sorry for the slow review @Zoxc. Is that PR enough to get rid of the rustc-rayon fork, or is there more waiting there? |
I have no plans to upstream anything in the |
☔ The latest upstream changes (presumably #109791) made this pull request unmergeable. Please resolve the merge conflicts. |
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
⌛ Trying commit efe7cf4 with merge b407b81070e680b8097a5568108933cdc4f1331a... |
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (b407b81070e680b8097a5568108933cdc4f1331a): comparison URL. Overall result: no relevant changes - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis benchmark run did not return any relevant results for this metric. Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
|
Discussed in the T-compiler triage meeting The members present were weakly in favor of this moving forward. (It was "weakly" in favor because in an ideal world we wouldn't have a fork of rayon, and likewise in an ideal world we would identify abstractions that the rayon-core is willing to adopt upstream, but since we are not in an ideal world, we will accept compromises.) |
@bors r+ |
☀️ Test successful - checks-actions |
1 similar comment
☀️ Test successful - checks-actions |
Finished benchmarking commit (c14882f): comparison URL. Overall result: ✅ improvements - no action needed@rustbot label: -perf-regression Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesThis benchmark run did not return any relevant results for this metric. |
This PR moves the definition of the
WorkerLocal
type fromrustc-rayon
intorustc_data_structures
. This is enabled by the introduction of theRegistry
type which allows you to group up threads to be used byWorkerLocal
which is basically just an array with an per thread index. TheRegistry
type mirrors the one in Rayon and each Rayon worker thread is also registered with the newRegistry
. Safety forWorkerLocal
is ensured by having it keep a reference to the registry and checking on each access that we're still on the group of threads associated with the registry used to construct it.Accessing a
WorkerLocal
is micro-optimized due to it being hot since it's used for most arena allocations.Performance is slightly improved for the parallel compiler:
cc @SparrowLii