Move the WorkerLocal type from the rustc-rayon fork into rustc_data_structures #107782

Zoxc · 2023-02-08T00:53:47Z

This PR moves the definition of the WorkerLocal type from rustc-rayon into rustc_data_structures. This is enabled by the introduction of the Registry type which allows you to group up threads to be used by WorkerLocal which is basically just an array with an per thread index. The Registry type mirrors the one in Rayon and each Rayon worker thread is also registered with the new Registry. Safety for WorkerLocal is ensured by having it keep a reference to the registry and checking on each access that we're still on the group of threads associated with the registry used to construct it.

Accessing a WorkerLocal is micro-optimized due to it being hot since it's used for most arena allocations.

Performance is slightly improved for the parallel compiler:

Benchmark	Before	After
Benchmark	Time	Time	%
🟣 clap:check	1.9992s	1.9949s	-0.21%
🟣 hyper:check	0.2977s	0.2970s	-0.22%
🟣 regex:check	1.1335s	1.1315s	-0.18%
🟣 syn:check	1.8235s	1.8171s	-0.35%
🟣 syntex_syntax:check	6.9047s	6.8930s	-0.17%
Total	12.1586s	12.1336s	-0.21%
Summary	1.0000s	0.9977s	-0.23%

cc @SparrowLii

rustbot · 2023-02-08T00:53:54Z

r? @jackh726

(rustbot has picked a reviewer for you, use r? to override)

Zoxc · 2023-02-08T00:54:56Z

compiler/rustc_interface/src/interface.rs

+    }
+
+    // Create a dummy registry to allow `WorkerLocal` construction.
+    // We use `OnceCell` so we only register one dummy registry per thread.


This is kind of hacky due to #101313 now using WorkerLocal outside the Rayon thread pool.

In this case #101313 shouldn't use WorkerLocal, since WorkerLocal::new() will reset the AttrIdAllocator. It can use Cell<u32> in single-thread mode and AtomicU32 in parallel mode, but this require the DynSendSync I metioned in #107586

Atomic adds are very cheap, reverting it may even be a performance improvement, so I think that's the best option.

Actually changing the PR to use AttrIdGenerator(AtomicU32) is nicer, as it avoids the global which is kind of incorrect.

Makes sense. But I think there will be some regression under single thread.

compiler/rustc_data_structures/src/sync/worker_local.rs

SparrowLii · 2023-02-08T04:08:12Z

compiler/rustc_data_structures/src/sync/worker_local.rs

+
+    /// Gets the registry associated with the current thread. Panics if there's no such registry.
+    pub fn current() -> Self {
+        REGISTRY.with(|registry| registry.get().cloned().expect("No assocated registry"))


Could we use registry.get_or_init(Registry::new(1)) here? If users don't explicitly call Registry::register(), it means they wouldn't like to use functions related to thread_index (in other word thread_index is always 0), and it is reasonable to limit thread_limit to 1 at this time.

In this case we can get rid of the hack code below.

That's still quite hacky. I'd prefer threads to explicitly opt in to using WorkerLocal. You could easily and up mixing up registries if there is multiple in use and causing panics.

oli-obk · 2023-02-08T10:47:11Z

Hmm... do you expect there to be more avenues for perf improvements here that cannot be done upstream? If we can avoid having more code to maintain that seems preferrable over a 0.2% improvement.

Zoxc · 2023-02-08T10:52:05Z

WorkerLocal is not available in upstream Rayon. This reduces the amount of changes we have to maintain in the rustc-rayon fork.

Zoxc · 2023-02-08T10:58:59Z

@cuviper Is this something you'd want in rayon-core?

While WorkerLocal is very useful, it is a bit awkard that it can only be used on the thread pool itself, compared to rayon's API.

Zoxc · 2023-02-08T11:53:42Z

Just for reference, the performance is orthogonal to the code's location.

Zoxc · 2023-02-08T12:09:58Z

Here's the current implementation.

cuviper · 2023-02-08T23:12:39Z

@cuviper Is this something you'd want in rayon-core?

The advantage to integration would just be avoiding the shadow-registry, right?

While WorkerLocal is very useful, it is a bit awkard that it can only be used on the thread pool itself, compared to rayon's API.

Yeah, it doesn't seem like a great fit that way, but I also can't think of how it could be any different.

Zoxc · 2023-02-08T23:26:28Z

The advantage to integration would just be avoiding the shadow-registry, right?

Yeah.

cjgillot · 2023-02-09T17:52:17Z

compiler/rustc_data_structures/src/sync/worker_local.rs

+    /// registry.
+    ///
+    /// Note that there's a race possible where the identifer in `THREAD_DATA` could be reused
+    /// so this can succeed from a different registry.


What would be the consequences?

The function doesn't panic. The public WorkerLocal type does prevents that race though.

compiler/rustc_data_structures/src/sync/worker_local.rs

cjgillot · 2023-02-09T18:03:30Z

compiler/rustc_data_structures/src/sync/worker_local.rs

+    fn deref(&self) -> &T {
+        // This is safe because `verify` will only return values less than
+        // `self.registry.thread_limit` which is the size of the `self.locals` array.
+        unsafe { &self.locals.get_unchecked(self.registry.id().verify()).0 }


Is get_unckecked really needed? verify will access TLS in any case, so will dominate the perf effect, won't it?

It reduces the panic paths from 2 to 1. TLS accesses are typically cheaper than panics. This is also not an additional proof obligation, as not only must the index be inbounds, it must refer to the correct thread.

compiler/rustc_interface/src/interface.rs

bors · 2023-02-13T17:38:02Z

☔ The latest upstream changes (presumably #107989) made this pull request unmergeable. Please resolve the merge conflicts.

Factor query arena allocation out from query caches This moves the logic for arena allocation out from the query caches into conditional code in the query system. The specialized arena caches are removed. A new `QuerySystem` type is added in `rustc_middle` which contains the arenas, providers and query caches. Performance seems to be slightly regressed: <table><tr><td rowspan="2">Benchmark</td><td colspan="1">Before</th><td colspan="2">After</th></tr><tr><td align="right">Time</td><td align="right">Time</td><td align="right">%</th></tr><tr><td>🟣 clap:check</td><td align="right">1.8053s</td><td align="right">1.8109s</td><td align="right"> 0.31%</td></tr><tr><td>🟣 hyper:check</td><td align="right">0.2600s</td><td align="right">0.2597s</td><td align="right"> -0.10%</td></tr><tr><td>🟣 regex:check</td><td align="right">0.9973s</td><td align="right">1.0006s</td><td align="right"> 0.34%</td></tr><tr><td>🟣 syn:check</td><td align="right">1.6048s</td><td align="right">1.6051s</td><td align="right"> 0.02%</td></tr><tr><td>🟣 syntex_syntax:check</td><td align="right">6.2992s</td><td align="right">6.3159s</td><td align="right"> 0.26%</td></tr><tr><td>Total</td><td align="right">10.9664s</td><td align="right">10.9922s</td><td align="right"> 0.23%</td></tr><tr><td>Summary</td><td align="right">1.0000s</td><td align="right">1.0017s</td><td align="right"> 0.17%</td></tr></table> Incremental performance is a bit worse: <table><tr><td rowspan="2">Benchmark</td><td colspan="1">Before</th><td colspan="2">After</th></tr><tr><td align="right">Time</td><td align="right">Time</td><td align="right">%</th></tr><tr><td>🟣 clap:check:initial</td><td align="right">2.2103s</td><td align="right">2.2247s</td><td align="right"> 0.65%</td></tr><tr><td>🟣 hyper:check:initial</td><td align="right">0.3335s</td><td align="right">0.3349s</td><td align="right"> 0.41%</td></tr><tr><td>🟣 regex:check:initial</td><td align="right">1.2597s</td><td align="right">1.2650s</td><td align="right"> 0.42%</td></tr><tr><td>🟣 syn:check:initial</td><td align="right">2.0521s</td><td align="right">2.0613s</td><td align="right"> 0.45%</td></tr><tr><td>🟣 syntex_syntax:check:initial</td><td align="right">7.8275s</td><td align="right">7.8583s</td><td align="right"> 0.39%</td></tr><tr><td>Total</td><td align="right">13.6832s</td><td align="right">13.7442s</td><td align="right"> 0.45%</td></tr><tr><td>Summary</td><td align="right">1.0000s</td><td align="right">1.0046s</td><td align="right"> 0.46%</td></tr></table> It does seem like LLVM optimizers struggle a bit with the current state of the query system. Based on top of rust-lang#107782 and rust-lang#107802. r? `@cjgillot`

cjgillot · 2023-03-09T15:56:52Z

Sorry for the slow review @Zoxc. Is that PR enough to get rid of the rustc-rayon fork, or is there more waiting there?
Did you make progress in upstreaming into rayon-core?

Zoxc · 2023-03-09T16:38:18Z

I have no plans to upstream anything in the rustc-rayon fork or to get rid of it at the moment. This is just to keep the fork more minimal.

bors · 2023-03-31T01:02:34Z

☔ The latest upstream changes (presumably #109791) made this pull request unmergeable. Please resolve the merge conflicts.

…tructures

oli-obk · 2023-04-16T07:18:18Z

@bors try @rust-timer queue

bors · 2023-04-16T07:18:26Z

⌛ Trying commit efe7cf4 with merge b407b81070e680b8097a5568108933cdc4f1331a...

bors · 2023-04-16T09:01:42Z

☀️ Try build successful - checks-actions
Build commit: b407b81070e680b8097a5568108933cdc4f1331a (b407b81070e680b8097a5568108933cdc4f1331a)

rust-timer · 2023-04-16T10:30:09Z

Finished benchmarking commit (b407b81070e680b8097a5568108933cdc4f1331a): comparison URL.

Overall result: no relevant changes - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-4.3%	[-4.3%, -4.3%]	1
Improvements ✅ (secondary)	-2.2%	[-2.9%, -1.2%]	4
All ❌✅ (primary)	-4.3%	[-4.3%, -4.3%]	1

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	2.5%	[2.5%, 2.5%]	1
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-	-	0

pnkfelix · 2023-04-27T14:46:17Z

Discussed in the T-compiler triage meeting

The members present were weakly in favor of this moving forward.

(It was "weakly" in favor because in an ideal world we wouldn't have a fork of rayon, and likewise in an ideal world we would identify abstractions that the rayon-core is willing to adopt upstream, but since we are not in an ideal world, we will accept compromises.)

cjgillot · 2023-04-27T15:56:38Z

@bors r+

bors · 2023-04-27T15:56:40Z

📌 Commit efe7cf4 has been approved by cjgillot

It is now in the queue for this repository.

bors · 2023-04-27T17:43:12Z

⌛ Testing commit efe7cf4 with merge c14882f...

bors · 2023-04-27T20:42:39Z

☀️ Test successful - checks-actions
Approved by: cjgillot
Pushing c14882f to master...

bors · 2023-04-27T20:42:40Z

☀️ Test successful - checks-actions
Approved by: cjgillot
Pushing c14882f to master...

rust-timer · 2023-04-27T22:22:10Z

Finished benchmarking commit (c14882f): comparison URL.

Overall result: ✅ improvements - no action needed

@rustbot label: -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-0.2%	[-0.2%, -0.2%]	1
All ❌✅ (primary)	-	-	0

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	2.7%	[2.7%, 2.7%]	1
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-	-	0

Cycles

This benchmark run did not return any relevant results for this metric.

rustbot assigned jackh726 Feb 8, 2023

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Feb 8, 2023

Zoxc commented Feb 8, 2023

View reviewed changes

Zoxc mentioned this pull request Feb 8, 2023

make mk_attr_id part of ParseSess #101313

Merged

SparrowLii reviewed Feb 8, 2023

View reviewed changes

Zoxc force-pushed the worker-local branch from 4b9e230 to 971d31d Compare February 8, 2023 08:19

cjgillot self-assigned this Feb 8, 2023

Zoxc mentioned this pull request Feb 9, 2023

Factor query arena allocation out from query caches #107833

Merged

cjgillot reviewed Feb 9, 2023

View reviewed changes

Zoxc force-pushed the worker-local branch from ba92cdd to 4136cfd Compare February 14, 2023 11:56

jackh726 removed their assignment Feb 18, 2023

Zoxc mentioned this pull request Mar 25, 2023

refactor WorkerLocal for parallel compiler #109478

Closed

Move the WorkerLocal type from the rustc-rayon fork into rustc_data_s…

64474a4

…tructures

Remove WorkerLocal from AttrIdGenerator

efe7cf4

Zoxc force-pushed the worker-local branch from 4136cfd to efe7cf4 Compare April 16, 2023 03:59

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Apr 16, 2023

This comment has been minimized.

Sign in to view

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Apr 16, 2023

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Apr 27, 2023

bors added merged-by-bors This PR was explicitly merged by bors. labels Apr 27, 2023

bors merged commit c14882f into rust-lang:master Apr 27, 2023

rustbot added this to the 1.71.0 milestone Apr 27, 2023

bors mentioned this pull request Apr 27, 2023

test the parallel compiler #109776

Closed

Zoxc deleted the worker-local branch April 28, 2023 03:10

Move the WorkerLocal type from the rustc-rayon fork into rustc_data_structures #107782

Move the WorkerLocal type from the rustc-rayon fork into rustc_data_structures #107782

Conversation

Zoxc commented Feb 8, 2023

rustbot commented Feb 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparrowLii Feb 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oli-obk commented Feb 8, 2023

Zoxc commented Feb 8, 2023

Zoxc commented Feb 8, 2023 • edited Loading

Zoxc commented Feb 8, 2023

Zoxc commented Feb 8, 2023

cuviper commented Feb 8, 2023

Zoxc commented Feb 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bors commented Feb 13, 2023

cjgillot commented Mar 9, 2023

Zoxc commented Mar 9, 2023

bors commented Mar 31, 2023

oli-obk commented Apr 16, 2023

This comment has been minimized.

bors commented Apr 16, 2023

bors commented Apr 16, 2023

This comment has been minimized.

rust-timer commented Apr 16, 2023

Overall result: no relevant changes - no action needed

pnkfelix commented Apr 27, 2023

cjgillot commented Apr 27, 2023

bors commented Apr 27, 2023

bors commented Apr 27, 2023

bors commented Apr 27, 2023

bors commented Apr 27, 2023

rust-timer commented Apr 27, 2023

Overall result: ✅ improvements - no action needed

SparrowLii Feb 8, 2023 •

edited

Loading

Zoxc commented Feb 8, 2023 •

edited

Loading