Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: test_factory_constructors.py failure triggered by test_register_duplicate_class #2564

Merged
merged 7 commits into from
Oct 16, 2020

Conversation

YannickJadoul
Copy link
Collaborator

@YannickJadoul YannickJadoul commented Oct 8, 2020

To reproduce, this fails on the centos:8 docker image (with development tools installed - see ci.yml - but without per se installing nvidia compilers), by running PYTEST_ADDOPTS="-k 'test_register_duplicate_class or test_init_factory_alias'" cmake --build build_centos8 --target pytest

Cross-reference #2335

EDIT: Figured out what's wrong. Short description of the bug:

  • test_class.py::test_register_duplicate_class registers py::class_ instances in different scopes, a module and a class (to test whether duplicate class names or types are caught; but that's not really the issue here).
  • In this test, the actual scopes passed to the py::class_ constructor are not referenced anymore after the scope of the test, and they go out of scope and get garbage collected. Since the py::class_es are only referenced by these scopes that are getting garbage collected, they also get garbage collected. However, they remain in the internals' registered_types_py.
  • When Python is allocating a new type, it reuses the old memory location of the garbage collected type object from before.
  • So, the same PyTypeObject pointer in registered_types_py now points to a new Python type, but still associates this with the old type_info (it is likely important that the new type allocated in "recycled" memory is a non-pybind11 type, in the case I debugged; if not, the associated type_info in registered_types_py would be overwritten).
  • pybind11 doesn't understand things anymore, because it's allocating the new type, but there's now a mismatch between the Python and C++ type, because of the stale PyTypeObject* in registered_types_py.

Current solution:

  • I copied the weakref from detail::all_type_info_get_cache. I still want to check if there's a better way of solving this, but this solves the problem.

Possible follow-up:

  • Now that stale classes get removed from the internals, we could think about not checking whether we're overwriting an already existing attribute in class_. I'll give this some more thought and I'll make a PR to further discuss this.

@henryiii
Copy link
Collaborator

henryiii commented Oct 8, 2020

I wonder if this is related to #2558 ? Edit: Sadly it didn't fix this one.

@YannickJadoul
Copy link
Collaborator Author

I wonder if this is related to #2558 ?

It would be amazing if it is (and we only need to solve 1 issue instead of 2), but I wouldn't really know why it would be?

@henryiii
Copy link
Collaborator

henryiii commented Oct 8, 2020

https://github.com/pybind/pybind11/pull/2564/checks?check_run_id=1227843279 looks suspicious to me. Notice the second test that failed (though not a segfault).

@YannickJadoul YannickJadoul force-pushed the factory-invalid-self-argument branch 4 times, most recently from 3752dd6 to d8eb801 Compare October 9, 2020 18:03
@YannickJadoul YannickJadoul changed the title Demonstrate test_factory_constructors.py failure without functional changes from #2335 Fix test_factory_constructors.py failure triggerd by test_register_duplicate_class Oct 9, 2020
@YannickJadoul YannickJadoul self-assigned this Oct 9, 2020
@YannickJadoul YannickJadoul changed the title Fix test_factory_constructors.py failure triggerd by test_register_duplicate_class Fix test_factory_constructors.py failure triggered by test_register_duplicate_class Oct 9, 2020
@henryiii henryiii changed the title Fix test_factory_constructors.py failure triggered by test_register_duplicate_class fix: test_factory_constructors.py failure triggered by test_register_duplicate_class Oct 9, 2020
@henryiii henryiii added this to the v2.6.0 milestone Oct 12, 2020
@EricCousineau-TRI EricCousineau-TRI requested review from EricCousineau-TRI and removed request for EricCousineau-TRI October 12, 2020 19:09
@wjakob
Copy link
Member

wjakob commented Oct 12, 2020

That's a nice catch. One word of caution about py::weakref: the garbage collector is a complex beast, and although the weakref callback will eventually be invoked, it's very difficult to guarantee that this happens at any particular point in time, or with a particular ordering compared to other stuff that is happening. Although it was a really neat trick, we had to get rid of weak references in the implementation of py::keep_alive (#856) for similar reasons. Based on these experiences, it seems to me that you might just be replacing one kind of undefined behavior by another (albeit likely much more rare).

@YannickJadoul
Copy link
Collaborator Author

YannickJadoul commented Oct 12, 2020

Hmmm, interesting. I'll still give that a look!

I did think of this, but it seemed reasonably deterministic, the way it's described in the Python docs (both normal as well as C API):

  • The Python docs for weakref.ref (https://docs.python.org/3/library/weakref.html#weakref.ref):

    If callback is provided and not None, and the returned weakref object is still alive, the callback will be called when the object is about to be finalized; the weak reference object will be passed as the only parameter to the callback; the referent will no longer be available.

  • The Python docs for PyWeakref_NewRef (https://docs.python.org/3/c-api/weakref.html#c.PyWeakref_NewRef):

    The second parameter, callback, can be a callable object that receives notification when ob is garbage collected; it should accept a single parameter, which will be the weak reference object itself.

So I understood it's not linked to the garbage collector running on the weakref, but on the reference object? (Which is fine, because as long as that one's not collected, Python won't recycle the memory anyway.)

But I'll look into #856 and try to convince myself that we're good!

@YannickJadoul
Copy link
Collaborator Author

I did think of this, but it seemed reasonably deterministic, the way it's described in the Python docs (both normal as well as C API):

Scratch all that. Should've read the issue first; this is not really related.

@wjakob, @henryiii, @EricCousineau-TRI, I still think we should be safe, because we're not too concerned about order, here? Worst case that could happen is some type gets removed from the internals, a bit before that type is actually destructed, but áfter the last reference to this type is gone? (Except if there's some weird cycle, perhaps, and the type is still used in the destructor? But I'm not sure we could fix that anyway, given the arbitrary destruction order of a cycle?)

We could try a similar approach to #856; I thought about that before seeing the weakref approach, but it would involve the metaclass, so we'd again bump into potential issues when using custom metaclasses? :-/

@henryiii
Copy link
Collaborator

I think this is fine, as it fixes the issue we are seeing, though if something else comes up, we can revisit?

@YannickJadoul
Copy link
Collaborator Author

I think this is fine, as it fixes the issue we are seeing, though if something else comes up, we can revisit?

Let's give @EricCousineau-TRI still a chance to review? He said he might have some time tomorrow. And I have a feeling he's been playing in these parts of pybind11 before.

@YannickJadoul
Copy link
Collaborator Author

YannickJadoul commented Oct 12, 2020

This is in internals:

type_map<type_info *> registered_types_cpp; // std::type_index -> pybind11's type information
std::unordered_map<PyTypeObject *, std::vector<type_info *>> registered_types_py; // PyTypeObject* -> base type_info(s)
std::unordered_multimap<const void *, instance*> registered_instances; // void * -> instance*
std::unordered_set<std::pair<const PyObject *, const char *>, override_hash> inactive_override_cache;
type_map<std::vector<bool (*)(PyObject *, void *&)>> direct_conversions;
std::unordered_map<const PyObject *, std::vector<PyObject *>> patients;
std::forward_list<void (*) (std::exception_ptr)> registered_exception_translators;
std::unordered_map<std::string, void *> shared_data; // Custom data to be shared across extensions
std::vector<PyObject *> loader_patient_stack; // Used by `loader_life_support`
std::forward_list<std::string> static_strings; // Stores the std::strings backing detail::c_str()
PyTypeObject *static_property_type;
PyTypeObject *default_metaclass;
PyObject *instance_base;

What else should we clear out? registered_types_cpp as well, right?

EDIT(eric): Made it a permalink code ref.

@YannickJadoul
Copy link
Collaborator Author

Thinking a bit further: is that the behavior we actually want?

I.e., in this case, once the Python type object (the one of the class_) gets garbage collected, it's unregistered. So it would also mean you can't go from C++ to Python, anymore. So you couldn't register an "anonymous" type with class_ that you're only returning from C++ to Python (unless you explicitly keep it alive on the C++ side).

Or, you register a C++ type Foo as my_lovely_module.Bar Python type, and a user manage to clear all references to my_lovely_module.Bar, and suddenly, a function returning Foo will not be able to convert its return type. Are we happy with that?

(To be clear, the above scenario currently results in a segfault, so at least it's better. But the other option would be to keep C++-registered type alive forever.)

@henryiii
Copy link
Collaborator

I think this is good for now, as it's at least better, and I don't think deleting class objects is common (except in some tests). And if you go to the trouble of deleting all references to a class, would you expect it to magically reappear if you tried to do a conversion afterwords?

@EricCousineau-TRI
Copy link
Collaborator

EricCousineau-TRI commented Oct 13, 2020

Are we happy with that?

Agreed with Henry. If a user manages to clear out references, then they probably have a reason to do so, so we shouldn't get in the way.

Additionally, a user could prevent that behavior by stashing a reference to the module (or type), preventing garbage colllection:

m.attr("_deps") = py::make_tuple(
  py::module::import("my_lovely_module"),
  ...
);

This is effectively what happens in nominal Python anyways.

Copy link
Collaborator

@EricCousineau-TRI EricCousineau-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just some minor comments.

include/pybind11/pybind11.h Outdated Show resolved Hide resolved
include/pybind11/pybind11.h Outdated Show resolved Hide resolved
@YannickJadoul
Copy link
Collaborator Author

And if you go to the trouble of deleting all references to a class, would you expect it to magically reappear if you tried to do a conversion afterwords?

Additionally, a user could prevent that behavior by stashing a reference to the module (or type), preventing garbage colllection:

True, OK. I was just afraid of unexpected surprises where suddenly a functions return type can't be returned anymore. In Python you'd never have that, because there'd still be a reference to it, when creating the object.

But I'm fine with waiting for complaints :-)

@wjakob
Copy link
Member

wjakob commented Oct 15, 2020

After taking another look, I am now convinced that the weak reference will work. That said, it seems far too heavy to create cpp_function for each type to set up a unique destruction callback. How about creating this callback method once in internals and then just initializing a single weak reference in cpp_type?

This would imply that the implementation has to be sufficiently generic -- it must e.g. map from the type to std::type_index and then figure stuff out from there.

@YannickJadoul
Copy link
Collaborator Author

YannickJadoul commented Oct 15, 2020

Tangentially related, it seems that there is a function

inline std::pair<decltype(internals::registered_types_py)::iterator, bool> all_type_info_get_cache(PyTypeObject *type) {

that is not used anywhere in pybind11, and which does some suspiciously related things.

That's where I got this approach, yes. I got reasonably confused for a while on whý things weren't getting cleaned up.

It is used here, though, I think:

const std::vector<detail::type_info *> &all_type_info(PyTypeObject *type) {
    auto ins = all_type_info_get_cache(type);
    if (ins.second)
        // New cache entry: populate it
        all_type_info_populate(type, ins.first->second);

    return ins.first->second;
}

@wjakob
Copy link
Member

wjakob commented Oct 15, 2020

@wjakob We do need access to the specific Python type object, though. And as far as I tried, you can't add arbitrary attributes to a weakref?

https://docs.python.org/3/library/weakref.html#weakref.ref says that

If callback is provided and not None, and the returned weakref object is still alive, the callback will be called when the object is about to be finalized; the weak reference object will be passed as the only parameter to the callback; the referent will no longer be available.

That is an excellent point, I had forgotten about that. Then how about catching things at a lower level: https://docs.python.org/3/c-api/typeobj.html#c.PyTypeObject.tp_finalize, in this case in the metaclass used to construct pybind11 types?

@YannickJadoul
Copy link
Collaborator Author

That is an excellent point, I had forgotten about that. Then how about catching things at a lower level: https://docs.python.org/3/c-api/typeobj.html#c.PyTypeObject.tp_finalize, in this case in the metaclass used to construct pybind11 types

That is the other option, yes. The detail here is that using a custom metaclass through py::metaclass(...) again will result in issues? And I preferred the non-invasive method of weakref, but maybe it's fine like this, for now?

@YannickJadoul
Copy link
Collaborator Author

it's fine like this, for now?

By which I mean: going with tp_finalize, and checking later how we can safely enable custom metaclasses?

@YannickJadoul
Copy link
Collaborator Author

@YannickJadoul YannickJadoul force-pushed the factory-invalid-self-argument branch from cfada0c to 51978a8 Compare October 15, 2020 22:03
@YannickJadoul YannickJadoul force-pushed the factory-invalid-self-argument branch from 2b8a3f4 to ae96b37 Compare October 15, 2020 22:38
@YannickJadoul
Copy link
Collaborator Author

@wjakob, @rwgk, @EricCousineau-TRI, @henryiii, I moved the cleanup function to tp_dealloc in the metaclass.

Please have a look and compare this PR with and without this last commit ae96b37, to compare the two approaches.

@YannickJadoul
Copy link
Collaborator Author

@rwgk I fixed the detail::type_info *tinfo leak you found, in both versions. I've done it in a slightly different way as originally suggested, but I think this is more elegant.

@henryiii
Copy link
Collaborator

I take it the second approach breaks custom meta classes that don't also call this?

@YannickJadoul
Copy link
Collaborator Author

I take it the second approach breaks custom meta classes that don't also call this?

Yes. Then again, they'd only be as broken as the current situation (which has been there for a long time, if not forever, as far as I know?).

@bstaletic
Copy link
Collaborator

I was going to ask if custom metaclasses ever worked in combination with pybind11.

@YannickJadoul
Copy link
Collaborator Author

I was going to ask if custom metaclasses ever worked in combination with pybind11.

Kind of (there is at least the py::metaclass tag), but anything that's defined in our pybind11-metaclass is of course not there by default (as far as I know, these are things like static properties, the extra check of __init__ having been called, ...)

Copy link
Collaborator

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR passed the Google-global testing.

@henryiii
Copy link
Collaborator

Then again, they'd only be as broken as the current situation

Okay, either approach is fine with me then.

@wjakob
Copy link
Member

wjakob commented Oct 16, 2020

Beautiful, thank you for the great work @YannickJadoul. I think that we can likely install a similar callback also for custom metaclasses, but this a much less important point that is not a blocker for v.2.6.0 (this kind issue revolving around garbage collection of pybind11-created types is rare enough in practice that we are just bothered by it now, and custom metaclasses are a super-rarely used feature of pybind11). Pr{A ∩ B}≈ Pr{A}·Pr{B}=tiny number, I hope 🙂.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants