make `PyErrState` thread-safe #4671

davidhewitt · 2024-10-30T19:56:06Z

This PR resolves the thread-safety challenges of #4584 for us to be able to at least ship 0.23.

I don't love the complexity that this lazy state creates inside error-handling pathways, so I think in the future I will work to proceed with #4669 and further steps to remove the lazy state. But 0.23 is already breaking enough, users don't need more changes and this should be an in-place drop-in.

ngoldbaum · 2024-10-31T15:05:07Z

I noticed clippy was failing so I just pushed a fix. I'll try to get the CI green on this if there are any more issues.

src/err/err_state.rs

ngoldbaum · 2024-10-31T15:35:49Z

src/err/err_state.rs

-            match self_state {
-                Some(PyErrStateInner::Normalized(n)) => n,
-                _ => unreachable!(),
+            let normalized_state = PyErrStateInner::Normalized(state.normalize(py));


I think the only spot where there might be a deadlock is here, if normalize somehow leads to arbitrary Python code execution.

Is that possible? If not I think it deserves a comment explaining why.

If it can deadlock, I'm not sure what we can do, since at this point we haven't actually constructed any Python objects yet and we only have a handle to an FnOnce that knows how to construct them.

Great observation; I've added a wrapping call to py.allow_threads before potentially blocking on the Once, which I think avoids the deadlock (I pushed a test which did deadlock before that change).

ngoldbaum · 2024-10-31T15:36:42Z

The algorithm makes sense to me, I agree that this ensures that normalizing an error state can't be done simultaneously in two threads.

codspeed-hq · 2024-10-31T21:13:43Z

CodSpeed Performance Report

Merging #4671 will degrade performances by 26.66%

_{Comparing davidhewitt:threadsafe-err (cd39af0) with main (d45e0bd)}

Summary

❌ 2 regressions
✅ 81 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

	Benchmark	`main`	`davidhewitt:threadsafe-err`	Change
❌	`enum_from_pyobject`	19 µs	25 µs	-24.08%
❌	`not_a_list_via_extract_enum`	13.4 µs	18.3 µs	-26.66%

ngoldbaum · 2024-10-31T21:17:28Z

Huh, I can reproduce the test failure happening on CI. It's flakey, but you can trigger it with cargo test --no-default-features --features "multiple-pymethods abi3-py37 full" --test "test_declarative_module" running in a while loop.

ngoldbaum · 2024-10-31T21:24:11Z

src/err/err_state.rs

-                .expect("Cannot normalize a PyErr while already normalizing it.")
-        };
+        // avoid deadlock of `.call_once` with the GIL
+        py.allow_threads(|| {


I guess somehow dropping the GIL somehow allows a race condition to happen where multiple threads try to simultaneously create a module...

Yeah, I think it's a combination with GILOnceCell in the test_declarative_module; we allow racing in GILOnceCell under the condition where switching the GIL, so this module does actually attempt to get created multiple times. I think it's a bug in using GILOnceCell for that test, but this also just makes me dislike this lazy stuff even more...

I guess this is just a fundamental issue with GILOnceCell being racey if the code it wraps ever drops the GIL.

EDIT: jinx!

I've opened #4676, if I apply that patch on this branch, the problem goes away.

src/err/err_state.rs

davidhewitt · 2024-11-04T06:46:33Z

Well, this passes CI, but somewhat expectedly the additional complexity added here brings performance overheads.

I think we either merge and release this, or we push forward on removing #4584 and the lazy state entirely, which I think yet needs some additional development but should hopefully be simpler and faster.

Given that I think 0.23 is already overdue and doesn't need more breaking changes (which #4584 would be) I think we should merge this here and look hard at optimising errors on the way to 0.24.

ngoldbaum · 2024-11-04T17:43:30Z

I wanted to understand where the slowdown was coming from a bit and also hopefully figure out how to do performance analysis on rust code.

I ended up generating the attached flamegraph after modifying the benchmark to run a million times inside a for loop to improve the statistics.

So a microbenchmark for failing to extract a tuple element and raising an error for it is slower, and we see it's spending some time inside the Once, which it didn't have to before.

I agree with David that this is acceptable given the safety issues we ran into elsewhere. We won't need to worry about re-acquiring the GIL or deadlocking if we can just get rid of normalization entirely and create an error while we initially have a handle on the runtime.

davidhewitt · 2024-11-04T20:28:53Z

Yes, I think a huge chunk of that flamegraph can be vanished away by not having this lazy layer. But that's a problem for 0.24 now, sigh.

davidhewitt · 2024-11-04T20:29:09Z

Will proceed to merge this so that we're one final hop closer to that release!

* make `PyErrState` thread-safe * fix clippy * add test of reentrancy, fix deadlock * newsfragment * fix MSRV * fix nightly build --------- Co-authored-by: Nathan Goldbaum <nathan.goldbaum@gmail.com>

src/err/err_state.rs

davidhewitt and others added 2 commits October 30, 2024 19:54

make PyErrState thread-safe

08beaa5

fix clippy

8421034

ngoldbaum reviewed Oct 31, 2024

View reviewed changes

src/err/err_state.rs Show resolved Hide resolved

ngoldbaum reviewed Oct 31, 2024

View reviewed changes

davidhewitt added 2 commits October 31, 2024 20:52

add test of reentrancy, fix deadlock

b4d3a94

newsfragment

4a30dde

ngoldbaum reviewed Oct 31, 2024

View reviewed changes

src/err/err_state.rs Outdated Show resolved Hide resolved

fix MSRV

f5fa452

davidhewitt mentioned this pull request Oct 31, 2024

add sync::OnceExt and sync::OnceLockExt traits #4676

Merged

davidhewitt added 2 commits November 2, 2024 20:38

Merge branch 'main' into threadsafe-err

55629f0

fix nightly build

d5ba475

davidhewitt added this pull request to the merge queue Nov 4, 2024

davidhewitt mentioned this pull request Nov 4, 2024

release: 0.23.0 #4651

Open

5 tasks

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 4, 2024

ngoldbaum reviewed Nov 4, 2024

View reviewed changes

src/err/err_state.rs Show resolved Hide resolved

ngoldbaum enabled auto-merge November 5, 2024 01:49

ngoldbaum added this pull request to the merge queue Nov 5, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 5, 2024

Update err_state.rs

cd39af0

ngoldbaum force-pushed the threadsafe-err branch from 9ba571d to cd39af0 Compare November 5, 2024 02:49

ngoldbaum enabled auto-merge November 5, 2024 02:49

ngoldbaum added this pull request to the merge queue Nov 5, 2024

Merged via the queue into PyO3:main with commit 9f955e4 Nov 5, 2024
43 of 44 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make `PyErrState` thread-safe #4671

make `PyErrState` thread-safe #4671

davidhewitt commented Oct 30, 2024

ngoldbaum commented Oct 31, 2024

ngoldbaum Oct 31, 2024

ngoldbaum Oct 31, 2024

davidhewitt Oct 31, 2024

ngoldbaum commented Oct 31, 2024

codspeed-hq bot commented Oct 31, 2024 •

edited

Loading

ngoldbaum commented Oct 31, 2024

ngoldbaum Oct 31, 2024

davidhewitt Oct 31, 2024 •

edited

Loading

ngoldbaum Oct 31, 2024 •

edited

Loading

davidhewitt Oct 31, 2024

davidhewitt commented Nov 4, 2024

ngoldbaum commented Nov 4, 2024

davidhewitt commented Nov 4, 2024

davidhewitt commented Nov 4, 2024

make PyErrState thread-safe #4671

make PyErrState thread-safe #4671

Conversation

davidhewitt commented Oct 30, 2024

ngoldbaum commented Oct 31, 2024

ngoldbaum Oct 31, 2024

Choose a reason for hiding this comment

ngoldbaum Oct 31, 2024

Choose a reason for hiding this comment

davidhewitt Oct 31, 2024

Choose a reason for hiding this comment

ngoldbaum commented Oct 31, 2024

codspeed-hq bot commented Oct 31, 2024 • edited Loading

CodSpeed Performance Report

Merging #4671 will degrade performances by 26.66%

Summary

Benchmarks breakdown

ngoldbaum commented Oct 31, 2024

ngoldbaum Oct 31, 2024

Choose a reason for hiding this comment

davidhewitt Oct 31, 2024 • edited Loading

Choose a reason for hiding this comment

ngoldbaum Oct 31, 2024 • edited Loading

Choose a reason for hiding this comment

davidhewitt Oct 31, 2024

Choose a reason for hiding this comment

davidhewitt commented Nov 4, 2024

ngoldbaum commented Nov 4, 2024

davidhewitt commented Nov 4, 2024

davidhewitt commented Nov 4, 2024

make `PyErrState` thread-safe #4671

make `PyErrState` thread-safe #4671

codspeed-hq bot commented Oct 31, 2024 •

edited

Loading

davidhewitt Oct 31, 2024 •

edited

Loading

ngoldbaum Oct 31, 2024 •

edited

Loading