-
Notifications
You must be signed in to change notification settings - Fork 763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky datetime test #720
Comments
When developing pyO3 on Windows with 3.8 the test wobbles for me occasionally too. |
This is a real head-scratcher. I would expect possible "flakiness" for the hypothesis-based tests based on rare edge-cases, but Maybe we should start running the tests with My immediate possible thoughts are some sort of thread-safety issue or possibly another test setting the Python exception state and returning without failing to check it (which would then raise an exception in whatever the next thing to run is that checks the exception state). One possibility is that this is bleeding over from some issue with these out-of-bounds tests, but that shouldn't be Windows-specific. |
I had this fail on me on CI earlier today. I think it was in a Linux environment?! |
I don't think so - maybe WSL or something, but a lot of the paths use I can't really make heads or tails of that backtrace. I guess what's happening is that something in I assume it's some sort of thread-safety issue. Maybe Anyone with a Windows machine want to try and make this slightly more reproducible? It might help to use a Barrier in all the different test functions to try and more readily induce race conditions. We might be able to make this less flaky by locking around |
Frustratingly I can't see anywhere in that output the result of |
I'm tempted to run the test suite in a loop on my machine this morning and see how long it takes to get a failure 😅 |
I ran it for a bit and added some debug printing too. There seems to be a race going on where multiple tests are making progress at the same time. This seems very odd given that all of the tests are supposedly acquiring the GIL... ?
Gotta head to work now, but I might pick this up again at a later date... |
...might be related to #288? |
This is an easy mistake to make (I made it when first building the This is important to keep in mind because it can lead to deadlocks if you attempt to layer on additional locking mechanisms - see this comment about how we avoid deadlocks in datetime. With regard to "multiple tests making progress at the same time", I think that's perfectly fine, but it seems like maybe the issue is that in If that's the problem, I'm kinda surprised we haven't seen this problem before now, and that it's only happening on Windows machines. If that turns out to be the issue, it should be pretty simple to either give these things unique names or put them in a unique namespace. |
I tried to make the names unique for each test: https://github.com/Alexander-N/pyo3/pull/3/files but the test still fails https://github.com/Alexander-N/pyo3/pull/3/checks?check_run_id=420934780. |
Strange. Also in the exception message I can't see the traceback being printed out, might need to run the tests with |
Not sure if this is related to buffering, sometimes I get a NameError like:
https://github.com/Alexander-N/pyo3/pull/3/checks?check_run_id=421093043 And sometimes not: https://github.com/Alexander-N/pyo3/pull/3/checks?check_run_id=421092950 |
Ah yeah, probably not buffering. It's I wonder, could it be some kind of ownership issue where these things have a zero reference count and are getting GC'd out from underneath us? I don't see how that could be possible though... |
I think that's exactly whats happening. In my branch the error is also reproducible on Linux and does disappear if I comment out the releasing of owned objects when the GILGuard is dropped: pub unsafe fn drain(&mut self, _py: Python, owned: usize, borrowed: usize, pointers: bool) {
// Release owned objects(call decref)
// while owned < self.owned.len() {
// let last = self.owned.pop_back().unwrap();
// ffi::Py_DECREF(last.as_ptr());
// }
// Release borrowed objects(don't call decref)
self.borrowed.truncate(borrowed);
if pointers {
self.release_pointers();
}
self.obj.clear();
} Depending on the timing of the thread exits I either get a NameError or SIGSEVs like these:
|
Oh crikey. I guess what must be happening is that when the first test finishes and drops its acquired A fix could potentially be to make the release pools thread-local? Or to have some kind of reference counting? If we eliminate the release pool as part of #679 then that might fix this too. |
The error seems to occur only if python modules are imported. I think what happens is something along the lines of: The imports release the GIL and the wrong object is popped from the release pool. For example this reproduces it fairly consistently for me: #[test]
fn test_release_race1() {
let gil = Python::acquire_gil();
let py = gil.python();
println!("make ob1");
let ob1 = py.eval("object()", None, None).unwrap();
py.import("datetime").unwrap();
py_run!(py, ob1, "print('ob1' + str(ob1))");
}
#[test]
fn test_release_race2() {
let gil = Python::acquire_gil();
let py = gil.python();
println!("make ob2");
let ob2 = py.eval("object()", None, None).unwrap();
py.import("datetime").unwrap();
py_run!(py, ob2, "print('ob2' + str(ob2))");
} |
After reading #720 (comment) again, I'll try to make the idea clearer:
Does this sound plausible? |
This sounds exactly like what I'm thinking must be going on. Presumably it would be possible to verify this ordering by adding a few more |
Though I think:
What actually happens is that both |
I thought that if there was one owned object in the release pool when the second test acquires it's GILGuard then while owned < self.owned.len() {
let last = self.owned.pop_back().unwrap();
ffi::Py_DECREF(last.as_ptr());
} |
Ah, sorry, yes, you're right! I should have read that code closer 😄 |
@Alexander-N @davidhewitt thank you for useful discussions!
OMG, it sounds serious problem to me 🤔 |
Agreed! This seems like another reason to consider replacing the release pool with a better solution. (Along with #679) |
This is quite a long thread now and it's sorta misleadingly titled since "Flaky datetime test" is just the proximate symptom, not the root problem to be solved. Might I suggest the following course of action:
That way people won't get hit with erroneous failures, and if we lose momentum on solving this issue, there will be a cleaner "entry point" for new contributors rather than reading through a long thread ostensibly about a flaky test. Good work everyone! |
Good point, I opened #756. |
The test
test_datetime_check
fails from time to time with an error likeIt could be that this occurs only on Windows with Python 3.6: https://github.com/PyO3/pyo3/pull/719/checks?check_run_id=384307686 and #651 (comment).
@pganssle Do you have any idea what could cause this?
The text was updated successfully, but these errors were encountered: