-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Unicode support for ostream redirects #2982
Conversation
Wow, thanks!
|
I see the Format action is failing. Could you please run this command and then git push to update this PR?
You may need to run this first (one-time):
|
Unless you really know what you are doing, never use sudo with pip. Either use a virtual env, user directory, or pipx (best). |
I didn't format before committing because that seemed to result in many changes, not just in the parts I changed. I'll format the code and push again. I also just noticed a problem with my logic, I have to make sure that |
The latest CI run found a valgrind error, I think in the new code. |
ASAN agrees with valgrind, which means it's almost certainly a real issue. Note that I applied this PR on top of the smart_holder branch, therefore some line numbers may be off. |
When I saw the
// Computes how many bytes at the end of the buffer are part of an
// incomplete sequence of UTF-8 bytes.
// Precondition: pbase() < pptr()
size_t utf8_remainder() const {
+ assert(pbase() < pptr());
const auto rbase = std::reverse_iterator<char *>(pbase());
const auto rpptr = std::reverse_iterator<char *>(pptr()); |
I have no clue what I'm doing, but guarding the |
Sadly, TSAN (thread sanitizer) is still failing (i.e. this PR does not appear to fix #2754). |
Thanks for taking the time to look into it! At first sight, this seems to be a concurrency bug as well: The check here: pybind11/include/pybind11/iostream.h Line 91 in 45a91d8
should ensure that the precondition holds, but if another thread enters the _sync function concurrently, it might update the pptr after this check and before the current thread acquiring the GIL.
I believe doing the check again after acquiring the GIL should fix this (which is equivalent to the |
That's what I was thinking, too. Note though that I'm pretty clueless about threads. I stared at the TSAN log for a while, after inserting a few more |
I am, but I don't have any experience with multithreading using the Python C API. From what I understand, the GIL ensures mutual exclusivity, so only one thread can be in the block between construction and destruction of the However, acquiring the GIL in the |
That sounds like the current implementation is fundamentally not suited for use with multithreading (which is what TSAN is telling us, too). Is there a way to error out cleanly? I feel a bit uneasy about providing iostream.h when we know it has this flaw. But IIUC this PR is valuable nevertheless, for single-threaded situations. We just need to get around the valgrind/ASAN issue. What makes sense? My simple-minded |
I quickly had a look at #2754, and I believe not using the three put area pointers of This locking would have to be done carefully, though, since the On the one hand, acquiring a mutex for every print statement seems detrimental for performance and parallelism, but on the other hand, this is exactly what the OS does whenever you call I noticed that C++20 has a new type of As a final thought, even though the standard guarantees that writing to
Yes, it does fix crashes in single-threaded code that prints UTF-8.
Fundamentally, as long as the concurrency bugs aren't fixed, there's no way to rule out buffer overflows entirely, even with the extra
Still, because we might spend time waiting on the GIL, the chance of this happening is much higher in the I'll push a fix with the size check later today. |
This doesn't solve the actual race, but at least it now has a much lower probability of reading past the end of the buffer even when data races do occur.
Thanks Pieter! I just triggered a new CI run and will also try again with our sanitizers (but I think it'll be fine now). I'm intrigued by what you wrote before:
What do you think about changing test_iostream.cpp accordingly? — The current situation of testing something we're sure is broken doesn't make much sense (IIUC). In a separate PR? Could you help out? Or outline what we need to do / guide me? |
Hi @tttapa and @henryiii, in hopes of getting rid of the CI flake, I went ahead with a very simple experiment (applied on top of this PR): Simple-minded systematic insertion of std::lock_guard before all std::cout, std::cerr. Is this an OK way to "provide your own synchronization" as Pieter suggested? EDIT: I forgot to mention, this fixes the TSAN error reported under #2754. |
Consider the following code: #include <iostream>
#include <thread>
int main() {
auto task = [](const char *thrd_name, char data) {
for (char c = data; c < data + 6; c++)
std::cout << "Thread " << thrd_name << " says: " << c << "\r\n";
};
std::jthread t1(task, "Fred", 'a');
std::jthread t2(task, "Bob", 'A');
} IIUC, the standard guarantees that this is not undefined behavior. Still, when I ran it, I got:
So even though there are no data races, and each character I sent to Given the fact that this synchronization has to be provided by the user if he wants meaningful results, I think it's not unreasonable to require pybind11 users to lock all (possibly) concurrent writes to However, if a user does write to If you want to provide the same guarantees as the standard, i.e. concurrent writes result in interleaved characters but no races, then you'll have to lock in pybind11's implementation of With respect to the changes you made to the tests, I'm not sure if that's necessary. Since most tests are single-threaded, there's no need for the locks there. In the threaded test, it kind of defeats the point of the test: by adding the lock around the I'm not sure where to go from here though. In a perfect world, you would implement a lock in |
Thanks a lot for the great explanation!
I don't think that difference is worth anyone's time, or the added code complexity.
Absolutely. I will merge your change now. — I feel it's best to document that iostream.h only works with proper user-side locking, and to completely remove the threaded test along with removing the duplicate |
Description
When printing UTF-8 data to a redirected ostream (using
pybind11::scoped_ostream_redirect
), Python randomly crashes.This happens because the
pybind11::pythonbuf
class is not Unicode aware: it only sees bytes and converts them to a Python string directly. When there is an unfinished multibyte UTF-8 sequenced at the end of the buffer, thePyUnicode_FromStringAndSize
function detects that it is not a valid Unicode string and returnsNULL
, resulting inpybind11_fail("Could not allocate string object!")
being called in thepybind11::str
constructor, raising an uncaught exception that takes down the entire application.The first commit (bf96760) adds some failing test cases to demonstrate the problem. For instance:
This results in the test being aborted entirely:
Solution
The second commit (4d655b0) fixes this issue by inspecting (up to) the last 3 bytes of the buffer to check whether they are part of an unfinished multibyte UTF-8 sequence. If they are, they are not included in the string forwarded to Python, and these bytes are then moved to the beginning of the buffer so they can be completed after the buffer is flushed.
If the last byte is a normal ASCII character, there is no difference.
If the buffer contains invalid Unicode, this is not handled by my fix, and it still triggers the same error in the
pybind11::str
constructor. Maybe this can be improved upon as well, crashing the entire program because of a Unicode error in the output seems a bit excessive to me. Maybe the error message can be made more specific as well?Could not allocate string object!
is not a clear error message for a Unicode problem, I spent considerably more time debugging why my Python program was randomly aborted than I'd like to admit ...Further improvements and considerations
This solution was tested on Linux only. I'm not familiar with the intricacies of how
std::cout
is used with Unicode strings on Windows (since Windows uses UTF-16 andwchar_t
for most strings).I didn't test this on Python 2.7 either, I imagine a different strategy might be needed since Python 2 handles Unicode differently.
Suggested changelog entry:
Fixed exception when printing UTF-8 to a ``scoped_ostream_redirect``.