Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move the WorkerLocal type from the rustc-rayon fork into rustc_data_structures #107782

Merged
merged 2 commits into from
Apr 27, 2023

Conversation

Zoxc
Copy link
Contributor

@Zoxc Zoxc commented Feb 8, 2023

This PR moves the definition of the WorkerLocal type from rustc-rayon into rustc_data_structures. This is enabled by the introduction of the Registry type which allows you to group up threads to be used by WorkerLocal which is basically just an array with an per thread index. The Registry type mirrors the one in Rayon and each Rayon worker thread is also registered with the new Registry. Safety for WorkerLocal is ensured by having it keep a reference to the registry and checking on each access that we're still on the group of threads associated with the registry used to construct it.

Accessing a WorkerLocal is micro-optimized due to it being hot since it's used for most arena allocations.

Performance is slightly improved for the parallel compiler:

BenchmarkBeforeAfter
TimeTime%
🟣 clap:check1.9992s1.9949s -0.21%
🟣 hyper:check0.2977s0.2970s -0.22%
🟣 regex:check1.1335s1.1315s -0.18%
🟣 syn:check1.8235s1.8171s -0.35%
🟣 syntex_syntax:check6.9047s6.8930s -0.17%
Total12.1586s12.1336s -0.21%
Summary1.0000s0.9977s -0.23%

cc @SparrowLii

@rustbot
Copy link
Collaborator

rustbot commented Feb 8, 2023

r? @jackh726

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Feb 8, 2023
}

// Create a dummy registry to allow `WorkerLocal` construction.
// We use `OnceCell` so we only register one dummy registry per thread.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of hacky due to #101313 now using WorkerLocal outside the Rayon thread pool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case #101313 shouldn't use WorkerLocal, since WorkerLocal::new() will reset the AttrIdAllocator. It can use Cell<u32> in single-thread mode and AtomicU32 in parallel mode, but this require the DynSendSync I metioned in #107586

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Atomic adds are very cheap, reverting it may even be a performance improvement, so I think that's the best option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually changing the PR to use AttrIdGenerator(AtomicU32) is nicer, as it avoids the global which is kind of incorrect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. But I think there will be some regression under single thread.

compiler/rustc_data_structures/src/sync/worker_local.rs Outdated Show resolved Hide resolved

/// Gets the registry associated with the current thread. Panics if there's no such registry.
pub fn current() -> Self {
REGISTRY.with(|registry| registry.get().cloned().expect("No assocated registry"))
Copy link
Member

@SparrowLii SparrowLii Feb 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use registry.get_or_init(Registry::new(1)) here? If users don't explicitly call Registry::register(), it means they wouldn't like to use functions related to thread_index (in other word thread_index is always 0), and it is reasonable to limit thread_limit to 1 at this time.

In this case we can get rid of the hack code below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's still quite hacky. I'd prefer threads to explicitly opt in to using WorkerLocal. You could easily and up mixing up registries if there is multiple in use and causing panics.

@oli-obk
Copy link
Contributor

oli-obk commented Feb 8, 2023

Hmm... do you expect there to be more avenues for perf improvements here that cannot be done upstream? If we can avoid having more code to maintain that seems preferrable over a 0.2% improvement.

@Zoxc
Copy link
Contributor Author

Zoxc commented Feb 8, 2023

WorkerLocal is not available in upstream Rayon. This reduces the amount of changes we have to maintain in the rustc-rayon fork.

@Zoxc
Copy link
Contributor Author

Zoxc commented Feb 8, 2023

@cuviper Is this something you'd want in rayon-core?

While WorkerLocal is very useful, it is a bit awkard that it can only be used on the thread pool itself, compared to rayon's API.

@Zoxc
Copy link
Contributor Author

Zoxc commented Feb 8, 2023

Just for reference, the performance is orthogonal to the code's location.

@cjgillot cjgillot self-assigned this Feb 8, 2023
@Zoxc
Copy link
Contributor Author

Zoxc commented Feb 8, 2023

@cuviper
Copy link
Member

cuviper commented Feb 8, 2023

@cuviper Is this something you'd want in rayon-core?

The advantage to integration would just be avoiding the shadow-registry, right?

While WorkerLocal is very useful, it is a bit awkard that it can only be used on the thread pool itself, compared to rayon's API.

Yeah, it doesn't seem like a great fit that way, but I also can't think of how it could be any different.

@Zoxc
Copy link
Contributor Author

Zoxc commented Feb 8, 2023

The advantage to integration would just be avoiding the shadow-registry, right?

Yeah.

/// registry.
///
/// Note that there's a race possible where the identifer in `THREAD_DATA` could be reused
/// so this can succeed from a different registry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the consequences?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function doesn't panic. The public WorkerLocal type does prevents that race though.

compiler/rustc_data_structures/src/sync/worker_local.rs Outdated Show resolved Hide resolved
compiler/rustc_data_structures/src/sync/worker_local.rs Outdated Show resolved Hide resolved
compiler/rustc_data_structures/src/sync/worker_local.rs Outdated Show resolved Hide resolved
fn deref(&self) -> &T {
// This is safe because `verify` will only return values less than
// `self.registry.thread_limit` which is the size of the `self.locals` array.
unsafe { &self.locals.get_unchecked(self.registry.id().verify()).0 }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is get_unckecked really needed? verify will access TLS in any case, so will dominate the perf effect, won't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It reduces the panic paths from 2 to 1. TLS accesses are typically cheaper than panics. This is also not an additional proof obligation, as not only must the index be inbounds, it must refer to the correct thread.

compiler/rustc_interface/src/interface.rs Outdated Show resolved Hide resolved
@bors
Copy link
Contributor

bors commented Feb 13, 2023

☔ The latest upstream changes (presumably #107989) made this pull request unmergeable. Please resolve the merge conflicts.

bors added a commit to rust-lang-ci/rust that referenced this pull request Feb 16, 2023
Factor query arena allocation out from query caches

This moves the logic for arena allocation out from the query caches into conditional code in the query system. The specialized arena caches are removed. A new `QuerySystem` type is added in `rustc_middle` which contains the arenas, providers and query caches.

Performance seems to be slightly regressed:
<table><tr><td rowspan="2">Benchmark</td><td colspan="1"><b>Before</b></th><td colspan="2"><b>After</b></th></tr><tr><td align="right">Time</td><td align="right">Time</td><td align="right">%</th></tr><tr><td>🟣 <b>clap</b>:check</td><td align="right">1.8053s</td><td align="right">1.8109s</td><td align="right"> 0.31%</td></tr><tr><td>🟣 <b>hyper</b>:check</td><td align="right">0.2600s</td><td align="right">0.2597s</td><td align="right"> -0.10%</td></tr><tr><td>🟣 <b>regex</b>:check</td><td align="right">0.9973s</td><td align="right">1.0006s</td><td align="right"> 0.34%</td></tr><tr><td>🟣 <b>syn</b>:check</td><td align="right">1.6048s</td><td align="right">1.6051s</td><td align="right"> 0.02%</td></tr><tr><td>🟣 <b>syntex_syntax</b>:check</td><td align="right">6.2992s</td><td align="right">6.3159s</td><td align="right"> 0.26%</td></tr><tr><td>Total</td><td align="right">10.9664s</td><td align="right">10.9922s</td><td align="right"> 0.23%</td></tr><tr><td>Summary</td><td align="right">1.0000s</td><td align="right">1.0017s</td><td align="right"> 0.17%</td></tr></table>

Incremental performance is a bit worse:
<table><tr><td rowspan="2">Benchmark</td><td colspan="1"><b>Before</b></th><td colspan="2"><b>After</b></th></tr><tr><td align="right">Time</td><td align="right">Time</td><td align="right">%</th></tr><tr><td>🟣 <b>clap</b>:check:initial</td><td align="right">2.2103s</td><td align="right">2.2247s</td><td align="right"> 0.65%</td></tr><tr><td>🟣 <b>hyper</b>:check:initial</td><td align="right">0.3335s</td><td align="right">0.3349s</td><td align="right"> 0.41%</td></tr><tr><td>🟣 <b>regex</b>:check:initial</td><td align="right">1.2597s</td><td align="right">1.2650s</td><td align="right"> 0.42%</td></tr><tr><td>🟣 <b>syn</b>:check:initial</td><td align="right">2.0521s</td><td align="right">2.0613s</td><td align="right"> 0.45%</td></tr><tr><td>🟣 <b>syntex_syntax</b>:check:initial</td><td align="right">7.8275s</td><td align="right">7.8583s</td><td align="right"> 0.39%</td></tr><tr><td>Total</td><td align="right">13.6832s</td><td align="right">13.7442s</td><td align="right"> 0.45%</td></tr><tr><td>Summary</td><td align="right">1.0000s</td><td align="right">1.0046s</td><td align="right"> 0.46%</td></tr></table>

It does seem like LLVM optimizers struggle a bit with the current state of the query system.

Based on top of rust-lang#107782 and rust-lang#107802.

r? `@cjgillot`
@jackh726 jackh726 removed their assignment Feb 18, 2023
@cjgillot
Copy link
Contributor

cjgillot commented Mar 9, 2023

Sorry for the slow review @Zoxc. Is that PR enough to get rid of the rustc-rayon fork, or is there more waiting there?
Did you make progress in upstreaming into rayon-core?

@Zoxc
Copy link
Contributor Author

Zoxc commented Mar 9, 2023

I have no plans to upstream anything in the rustc-rayon fork or to get rid of it at the moment. This is just to keep the fork more minimal.

@bors
Copy link
Contributor

bors commented Mar 31, 2023

☔ The latest upstream changes (presumably #109791) made this pull request unmergeable. Please resolve the merge conflicts.

@oli-obk
Copy link
Contributor

oli-obk commented Apr 16, 2023

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Apr 16, 2023
@bors
Copy link
Contributor

bors commented Apr 16, 2023

⌛ Trying commit efe7cf4 with merge b407b81070e680b8097a5568108933cdc4f1331a...

@bors
Copy link
Contributor

bors commented Apr 16, 2023

☀️ Try build successful - checks-actions
Build commit: b407b81070e680b8097a5568108933cdc4f1331a (b407b81070e680b8097a5568108933cdc4f1331a)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (b407b81070e680b8097a5568108933cdc4f1331a): comparison URL.

Overall result: no relevant changes - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-4.3% [-4.3%, -4.3%] 1
Improvements ✅
(secondary)
-2.2% [-2.9%, -1.2%] 4
All ❌✅ (primary) -4.3% [-4.3%, -4.3%] 1

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
2.5% [2.5%, 2.5%] 1
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) - - 0

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Apr 16, 2023
@pnkfelix
Copy link
Member

Discussed in the T-compiler triage meeting

The members present were weakly in favor of this moving forward.

(It was "weakly" in favor because in an ideal world we wouldn't have a fork of rayon, and likewise in an ideal world we would identify abstractions that the rayon-core is willing to adopt upstream, but since we are not in an ideal world, we will accept compromises.)

@cjgillot
Copy link
Contributor

@bors r+

@bors
Copy link
Contributor

bors commented Apr 27, 2023

📌 Commit efe7cf4 has been approved by cjgillot

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Apr 27, 2023
@bors
Copy link
Contributor

bors commented Apr 27, 2023

⌛ Testing commit efe7cf4 with merge c14882f...

@bors
Copy link
Contributor

bors commented Apr 27, 2023

☀️ Test successful - checks-actions
Approved by: cjgillot
Pushing c14882f to master...

1 similar comment
@bors
Copy link
Contributor

bors commented Apr 27, 2023

☀️ Test successful - checks-actions
Approved by: cjgillot
Pushing c14882f to master...

@bors bors added merged-by-bors This PR was explicitly merged by bors. labels Apr 27, 2023
@bors bors merged commit c14882f into rust-lang:master Apr 27, 2023
@rustbot rustbot added this to the 1.71.0 milestone Apr 27, 2023
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (c14882f): comparison URL.

Overall result: ✅ improvements - no action needed

@rustbot label: -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-0.2% [-0.2%, -0.2%] 1
All ❌✅ (primary) - - 0

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
2.7% [2.7%, 2.7%] 1
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) - - 0

Cycles

This benchmark run did not return any relevant results for this metric.

@Zoxc Zoxc deleted the worker-local branch April 28, 2023 03:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants