Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show segfault in embed scoped_interpreter test #4500

Closed
wants to merge 2 commits into from

Conversation

PhilipDeegan
Copy link
Contributor

Hey, we've noticed we're getting a segfault in the destructor of the scoped_interpreter.

I've updated a test of yours to force a full lifecycle of the scoped_interpreter class before anything else.

I've run this on my own fork and it does segfault.

It's possible I have misunderstood something so please let me know if this is not a smart thing to be doing.

@rwgk
Copy link
Collaborator

rwgk commented Feb 8, 2023

Hey, we've noticed we're getting a segfault in the destructor of the scoped_interpreter.

Interesting ... I don't have good ideas (but I'm also not an embedding expert, although I fixed up a thing here and there recently).

We have lots of py::scoped_interpreter that go out of scope in test_interpreter.cpp, I wonder what's special about the situation you created with this PR.

Another question: did this work before? With an older version of pybind11? If yes, when did it break? (git bisect?)

But maybe the first thing to try: run in a debugger and look at the stack trace. Could you please try and post the stack trace here?

@PhilipDeegan
Copy link
Contributor Author

hi,

did this work before?

It was working as of a yesterday.

locally with the following

#include "pybind11/embed.h"

int main(){
    pybind11::scoped_interpreter g;
}

I get this stack trace

801 Python/pystate.c: No such file or directory.
#0  0x00007ffff7d002a6 in new_threadstate (interp=0x0) at Python/pystate.c:801
#1  0x00007ffff7d00267 in PyThreadState_New (interp=<optimized out>) at Python/pystate.c:855
#2  0x00007ffff7bc0a56 in PyGILState_Ensure () at Python/pystate.c:1685
#3  0x000055555555fe75 in pybind11::detail::get_internals()::gil_scoped_acquire_local::gil_scoped_acquire_local() (this=0x7fffffffdc94) at p/include/pybind11/detail/../detail/internals.h:438
#4  0x0000555555560969 in pybind11::detail::get_internals () at p/include/pybind11/detail/../detail/internals.h:443
#5  0x0000555555560e54 in pybind11::detail::local_internals::local_internals (this=0x5555555af3d0) at p/include/pybind11/detail/../detail/internals.h:529
#6  0x0000555555560faa in pybind11::detail::get_local_internals () at p/include/pybind11/detail/../detail/internals.h:548
#7  0x000055555556931a in pybind11::finalize_interpreter () at p/include/pybind11/embed.h:264
#8  0x0000555555569444 in pybind11::scoped_interpreter::~scoped_interpreter (this=0x7fffffffdeef, __in_chrg=<optimized out>) at p/include/pybind11/embed.h:313
#9  0x000055555555af66 in main () at cpp.cpp:5

If I try a commit from Dec 1st 2022 it works

Author: Aaron Gokaslan <aaronGokaslan@gmail.com>
Date:   Thu Dec 1 15:15:47 2022 -0500

    chore: Convert direct multiprocessing.set_start_method("forkserver") call to a pytest fixture. (#4377)
    
    * chore: convert multiprocessing set_spawn to fixture in pytest
    
    * Switch to early return

@PhilipDeegan
Copy link
Contributor Author

PhilipDeegan commented Feb 8, 2023

If I pop off the top two commits it no longer segfaults

@rwgk
Copy link
Collaborator

rwgk commented Feb 8, 2023

If I pop off the top two commits it no longer segfaults

Then I'd say it's pretty much certain, #4459 is the culprit.

That one came without a test. I'd say let's roll it back until we understand why it's causing the segfault here.

@rwgk
Copy link
Collaborator

rwgk commented Feb 8, 2023

Probably the trouble really started with #3744.

Apparently that broke atexit, which led to #4459.

The question to solve: how can we make this PR and atexit work?

We need a volunteer to work on this, ideally this time around with full unit testing, even if it's not easy.

@Skylion007 I'll create the rollback PR.

rwgk pushed a commit to rwgk/pybind11 that referenced this pull request Feb 8, 2023
rwgk pushed a commit that referenced this pull request Feb 8, 2023
@PhilipDeegan
Copy link
Contributor Author

PhilipDeegan commented Feb 8, 2023

Do you want to merge this, or something similar to prevent these kinds of issues in future?
You can see things are passing now that that commit has been reverted.

Feel free to close otherwise.

@Skylion007
Copy link
Collaborator

@PhilipDeegan Absolutely, if you could rebase and this doesn't SEGFAULT, then it would be a nice addition to the test suite (although please add a commit referencing the issue).

@Skylion007
Copy link
Collaborator

Skylion007 commented Feb 8, 2023

So I did some digging through the stack trace and found what's going on a bit @rwgk. The issue is the detail::get_local_internals() cannot be called safely after Py_Finalize():

detail::get_local_internals().registered_types_cpp.clear();

There is some serious nasty static variable behavior in our code right now. We construct local_internals and store it in a static variable.

static auto *locals = new local_internals();

The ctor for local_internals() has multiple calls to the Python API though which is not safe when the interpreter is not initialized! It seems in practice to be initialized on the first call to get_local_internals(), which with atexit fix is after Py_Finalize(). The first thing it will do is call get_internals() which will try to do a GIL lock before calling Python calls. It can't do this though because the interpreter isn't initialized! Hence the segfault:

gil_scoped_acquire_local() : state(PyGILState_Ensure()) {}

The real issue is that the initial construction of local_internals is unsafe, UB and this is exposing it. The hotfix for this is just to put that previous change back in, but cache a reference to the output of get_local_internals() before Py_Finalize() is called then clear the C++ type_info after Py_Finalize(). Or safer yet, cache references to the C++ maps themselves and only clear them.

@Skylion007
Copy link
Collaborator

Actually @PhilipDeegan @rwgk, I introduced your test case into our test suite and fixed the atexit bug in #4505.

@Skylion007
Copy link
Collaborator

Hmm, I spoke too soon. Seems like some compilers are exploiting the UB for an optimization which is causing them to still be broken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants