Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make PyErrState thread-safe #4671

Merged
merged 8 commits into from
Nov 5, 2024
Merged

Conversation

davidhewitt
Copy link
Member

This PR resolves the thread-safety challenges of #4584 for us to be able to at least ship 0.23.

I don't love the complexity that this lazy state creates inside error-handling pathways, so I think in the future I will work to proceed with #4669 and further steps to remove the lazy state. But 0.23 is already breaking enough, users don't need more changes and this should be an in-place drop-in.

@ngoldbaum
Copy link
Contributor

I noticed clippy was failing so I just pushed a fix. I'll try to get the CI green on this if there are any more issues.

match self_state {
Some(PyErrStateInner::Normalized(n)) => n,
_ => unreachable!(),
let normalized_state = PyErrStateInner::Normalized(state.normalize(py));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only spot where there might be a deadlock is here, if normalize somehow leads to arbitrary Python code execution.

Is that possible? If not I think it deserves a comment explaining why.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it can deadlock, I'm not sure what we can do, since at this point we haven't actually constructed any Python objects yet and we only have a handle to an FnOnce that knows how to construct them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great observation; I've added a wrapping call to py.allow_threads before potentially blocking on the Once, which I think avoids the deadlock (I pushed a test which did deadlock before that change).

@ngoldbaum
Copy link
Contributor

The algorithm makes sense to me, I agree that this ensures that normalizing an error state can't be done simultaneously in two threads.

Copy link

codspeed-hq bot commented Oct 31, 2024

CodSpeed Performance Report

Merging #4671 will degrade performances by 26.66%

Comparing davidhewitt:threadsafe-err (cd39af0) with main (d45e0bd)

Summary

❌ 2 regressions
✅ 81 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main davidhewitt:threadsafe-err Change
enum_from_pyobject 19 µs 25 µs -24.08%
not_a_list_via_extract_enum 13.4 µs 18.3 µs -26.66%

@ngoldbaum
Copy link
Contributor

Huh, I can reproduce the test failure happening on CI. It's flakey, but you can trigger it with cargo test --no-default-features --features "multiple-pymethods abi3-py37 full" --test "test_declarative_module" running in a while loop.

.expect("Cannot normalize a PyErr while already normalizing it.")
};
// avoid deadlock of `.call_once` with the GIL
py.allow_threads(|| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess somehow dropping the GIL somehow allows a race condition to happen where multiple threads try to simultaneously create a module...

Copy link
Member Author

@davidhewitt davidhewitt Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it's a combination with GILOnceCell in the test_declarative_module; we allow racing in GILOnceCell under the condition where switching the GIL, so this module does actually attempt to get created multiple times. I think it's a bug in using GILOnceCell for that test, but this also just makes me dislike this lazy stuff even more...

Copy link
Contributor

@ngoldbaum ngoldbaum Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is just a fundamental issue with GILOnceCell being racey if the code it wraps ever drops the GIL.

EDIT: jinx!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened #4676, if I apply that patch on this branch, the problem goes away.

src/err/err_state.rs Outdated Show resolved Hide resolved
@davidhewitt
Copy link
Member Author

Well, this passes CI, but somewhat expectedly the additional complexity added here brings performance overheads.

I think we either merge and release this, or we push forward on removing #4584 and the lazy state entirely, which I think yet needs some additional development but should hopefully be simpler and faster.

Given that I think 0.23 is already overdue and doesn't need more breaking changes (which #4584 would be) I think we should merge this here and look hard at optimising errors on the way to 0.24.

@ngoldbaum
Copy link
Contributor

I wanted to understand where the slowdown was coming from a bit and also hopefully figure out how to do performance analysis on rust code.

I ended up generating the attached flamegraph after modifying the benchmark to run a million times inside a for loop to improve the statistics.

So a microbenchmark for failing to extract a tuple element and raising an error for it is slower, and we see it's spending some time inside the Once, which it didn't have to before.

I agree with David that this is acceptable given the safety issues we ran into elsewhere. We won't need to worry about re-acquiring the GIL or deadlocking if we can just get rid of normalization entirely and create an error while we initially have a handle on the runtime.
flamegraph

@davidhewitt
Copy link
Member Author

Yes, I think a huge chunk of that flamegraph can be vanished away by not having this lazy layer. But that's a problem for 0.24 now, sigh.

@davidhewitt
Copy link
Member Author

Will proceed to merge this so that we're one final hop closer to that release!

@davidhewitt davidhewitt added this pull request to the merge queue Nov 4, 2024
github-merge-queue bot pushed a commit that referenced this pull request Nov 4, 2024
* make `PyErrState` thread-safe

* fix clippy

* add test of reentrancy, fix deadlock

* newsfragment

* fix MSRV

* fix nightly build

---------

Co-authored-by: Nathan Goldbaum <nathan.goldbaum@gmail.com>
@davidhewitt davidhewitt mentioned this pull request Nov 4, 2024
5 tasks
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 4, 2024
@ngoldbaum ngoldbaum added this pull request to the merge queue Nov 5, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 5, 2024
@ngoldbaum ngoldbaum added this pull request to the merge queue Nov 5, 2024
Merged via the queue into PyO3:main with commit 9f955e4 Nov 5, 2024
43 of 44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants