Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Unicode support for ostream redirects #2982

Merged
merged 5 commits into from
May 4, 2021

Conversation

tttapa
Copy link
Contributor

@tttapa tttapa commented Apr 29, 2021

Description

When printing UTF-8 data to a redirected ostream (using pybind11::scoped_ostream_redirect), Python randomly crashes.
This happens because the pybind11::pythonbuf class is not Unicode aware: it only sees bytes and converts them to a Python string directly. When there is an unfinished multibyte UTF-8 sequenced at the end of the buffer, the PyUnicode_FromStringAndSize function detects that it is not a valid Unicode string and returns NULL, resulting in pybind11_fail("Could not allocate string object!") being called in the pybind11::str constructor, raising an uncaught exception that takes down the entire application.

The first commit (bf96760) adds some failing test cases to demonstrate the problem. For instance:

def test_captured_utf8_2byte_offset1(capsys):
    msg = "\u07FF"
    msg = "1" + msg * (1024 // len(msg) + 1)

    m.captured_output_default(msg)
    stdout, stderr = capsys.readouterr()
    assert stdout == msg
    assert stderr == ""

This results in the test being aborted entirely:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Could not allocate string object!
Fatal Python error: Aborted
Aborted (core dumped)
FAILED: tests/CMakeFiles/pytest 

Solution

The second commit (4d655b0) fixes this issue by inspecting (up to) the last 3 bytes of the buffer to check whether they are part of an unfinished multibyte UTF-8 sequence. If they are, they are not included in the string forwarded to Python, and these bytes are then moved to the beginning of the buffer so they can be completed after the buffer is flushed.
If the last byte is a normal ASCII character, there is no difference.

If the buffer contains invalid Unicode, this is not handled by my fix, and it still triggers the same error in the pybind11::str constructor. Maybe this can be improved upon as well, crashing the entire program because of a Unicode error in the output seems a bit excessive to me. Maybe the error message can be made more specific as well? Could not allocate string object! is not a clear error message for a Unicode problem, I spent considerably more time debugging why my Python program was randomly aborted than I'd like to admit ...

Further improvements and considerations

This solution was tested on Linux only. I'm not familiar with the intricacies of how std::cout is used with Unicode strings on Windows (since Windows uses UTF-16 and wchar_t for most strings).
I didn't test this on Python 2.7 either, I imagine a different strategy might be needed since Python 2 handles Unicode differently.

Suggested changelog entry:

Fixed exception when printing UTF-8 to a ``scoped_ostream_redirect``.

@rwgk rwgk requested a review from henryiii April 29, 2021 21:17
@rwgk
Copy link
Collaborator

rwgk commented Apr 29, 2021

Wow, thanks!

  • I've seen those random crashes several times, too.
  • Currently test_iostream.py is our worst CI flake. Maybe this PR will help?
  • I also wonder if this is related: [BUG] TSAN error logs #2754
  • I'll test more later (after I see the CI results here).

@rwgk
Copy link
Collaborator

rwgk commented Apr 29, 2021

I see the Format action is failing. Could you please run this command and then git push to update this PR?

pre-commit run --all-files

You may need to run this first (one-time):

sudo python3 -m pip install pre-commit

@henryiii
Copy link
Collaborator

Unless you really know what you are doing, never use sudo with pip. Either use a virtual env, user directory, or pipx (best).

@tttapa
Copy link
Contributor Author

tttapa commented Apr 29, 2021

I didn't format before committing because that seemed to result in many changes, not just in the parts I changed. I'll format the code and push again.

I also just noticed a problem with my logic, I have to make sure that rpend != leading before dereferencing leading when computing the remainder, otherwise it's dereferencing the end iterator. I'll fix that as well.

@rwgk
Copy link
Collaborator

rwgk commented Apr 30, 2021

The latest CI run found a valgrind error, I think in the new code.
I'll also try our sanitizers.

@rwgk
Copy link
Collaborator

rwgk commented Apr 30, 2021

ASAN agrees with valgrind, which means it's almost certainly a real issue.
I'm attaching the ASAN log.
pr2982_asan_heap-buffer-overflow.txt

Note that I applied this PR on top of the smart_holder branch, therefore some line numbers may be off.

@rwgk
Copy link
Collaborator

rwgk commented Apr 30, 2021

When I saw the Precondition comment I added that as an assert. It fails:

assert.h assertion failed at third_party/pybind11/include/pybind11/iostream.h:48 in pybind11::size_t pybind11::detail::pythonbuf::utf8_remainder() const: pbase() < pptr()
     // Computes how many bytes at the end of the buffer are part of an          
     // incomplete sequence of UTF-8 bytes.                                      
     // Precondition: pbase() < pptr()                                           
     size_t utf8_remainder() const {                                             
+        assert(pbase() < pptr());                                               
         const auto rbase = std::reverse_iterator<char *>(pbase());              
         const auto rpptr = std::reverse_iterator<char *>(pptr());               

@rwgk
Copy link
Collaborator

rwgk commented Apr 30, 2021

I have no clue what I'm doing, but guarding the remainder calculation and everything below with if (size) { ... } fixes the ASAN issue, with test_iostream still passing.

@rwgk
Copy link
Collaborator

rwgk commented Apr 30, 2021

Sadly, TSAN (thread sanitizer) is still failing (i.e. this PR does not appear to fix #2754).

@tttapa
Copy link
Contributor Author

tttapa commented Apr 30, 2021

ASAN agrees with valgrind, which means it's almost certainly a real issue.
I'm attaching the ASAN log.
pr2982_asan_heap-buffer-overflow.txt

Note that I applied this PR on top of the smart_holder branch, therefore some line numbers may be off.

Thanks for taking the time to look into it!

At first sight, this seems to be a concurrency bug as well:

The check here:

if (pbase() != pptr()) {

should ensure that the precondition holds, but if another thread enters the _sync function concurrently, it might update the pptr after this check and before the current thread acquiring the GIL.

I believe doing the check again after acquiring the GIL should fix this (which is equivalent to the ìf (size) check you added).

@rwgk
Copy link
Collaborator

rwgk commented Apr 30, 2021

should ensure that the precondition holds, but if another thread enters the _sync function concurrently

That's what I was thinking, too. Note though that I'm pretty clueless about threads.

I stared at the TSAN log for a while, after inserting a few more PyGILState_Ensure. It strikes me that there seems to be a data race even though I'm sure I'm holding the GIL. Are you familiar with threading? If I'm in a section of code with the GIL held, is it still possible that another C++ thread is active at the same time? — Oh, I think calling into the Python C API means the GIL could be released and reacquired before the call returns. That could be one way I guess. If that's the problem, what could be a fix?

@tttapa
Copy link
Contributor Author

tttapa commented Apr 30, 2021

Are you familiar with threading? If I'm in a section of code with the GIL held, is it still possible that another C++ thread is active at the same time?

I am, but I don't have any experience with multithreading using the Python C API.

From what I understand, the GIL ensures mutual exclusivity, so only one thread can be in the block between construction and destruction of the gil_scoped_acquire.

However, acquiring the GIL in the sync function alone is insufficient. When using locks like that, you need to protect every access to the shared state, i.e. all accesses to pptr(), calls to pbump(), sputc() etc. which doesn't seem to be the case in the current implementation (to be honest, it doesn't seem to be an easy task).

@rwgk
Copy link
Collaborator

rwgk commented Apr 30, 2021

However, acquiring the GIL in the sync function alone is insufficient.

That sounds like the current implementation is fundamentally not suited for use with multithreading (which is what TSAN is telling us, too). Is there a way to error out cleanly? I feel a bit uneasy about providing iostream.h when we know it has this flaw.

But IIUC this PR is valuable nevertheless, for single-threaded situations. We just need to get around the valgrind/ASAN issue. What makes sense? My simple-minded size check? Could you please try some workaround?

@tttapa
Copy link
Contributor Author

tttapa commented May 1, 2021

Is there a way to error out cleanly? I feel a bit uneasy about providing iostream.h when we know it has this flaw.

I quickly had a look at #2754, and I believe not using the three put area pointers of std::streambuf would indeed be the only solution. This means the stream on top always calls either xsputn or overflow, which are both virtual, so we can lock there. This doesn't necessarily mean we lose all buffering, the pythonbuf class could still use a buffer under the hood, just without using std::streambuf::pptr() etc. All accesses to that buffer and the put pointer are then protected by a mutex.

This locking would have to be done carefully, though, since the _sync function would have to lock both the mutex protecting the pythonbuf buffer and the GIL, so you have to make sure this doesn't result in deadlock.
There might be lock-free algorithms for pushing characters into a buffer, but getting stuff like that right is not an easy task either.

On the one hand, acquiring a mutex for every print statement seems detrimental for performance and parallelism, but on the other hand, this is exactly what the OS does whenever you call putchar or puts (in contrast to putchar_unlocked and puts_unlocked), and if users want performant multithreading code, they shouldn't be printing to cout in the first place.

I noticed that C++20 has a new type of streambuf: std::basic_syncbuf
It could be overkill, since it seems to guarantee atomicity for different instances using the same buffer etc. but might be useful nevertheless.

As a final thought, even though the standard guarantees that writing to cout concurrently does not result in races or UB, realistically, this is not an issue in well-written code: the standard still allows characters to be interleaved, so if you want any sort of meaningful output, you'll have to provide your own synchronization around calls to cout as a user.
So in practice, I don't think the current implementation scoped_ostream_redirect is that problematic. Still, the limitations should be mentioned somewhere in the documentation, and if you can easily fix this using C++20's basic_syncbuf that might be worth it. Whether the time and effort of fixing this in a C++11-compatible way is worth it is not something I can judge, but like I said, in practice I don't believe this is a serious issue.

But IIUC this PR is valuable nevertheless, for single-threaded situations.

Yes, it does fix crashes in single-threaded code that prints UTF-8.

We just need to get around the valgrind/ASAN issue. What makes sense? My simple-minded size check? Could you please try some workaround?

Fundamentally, as long as the concurrency bugs aren't fixed, there's no way to rule out buffer overflows entirely, even with the extra size check.
Consider GCC's sputc implementation:
https://github.com/gcc-mirror/gcc/blob/16e2427f50c208dfe07d07f18009969502c25dc8/libstdc%2B%2B-v3/include/std/streambuf#L428-L441
You could easily end up in a situation where the following happens:

--- Thread A ---            --- Thread B ---
if (pptr < end) 
                            if (pptr < end)
                                *pptr = c
                                ++pptr
    *pptr = c                                    // pptr could've been incremented beyond end in the meantime
    ++ptr

Still, because we might spend time waiting on the GIL, the chance of this happening is much higher in the _sync function, so even though it won't solve the issues with sputc and xsputn, adding the size check under the GIL is a good temporary solution, and as long as the other issues in sputc etc. don't show up, this should make the Asan/Valgrind tests pass with a high probability (but not 100% certainty, since sputc could race with itself, with xsputn, with sync, with overflow etc.).

I'll push a fix with the size check later today.

This doesn't solve the actual race, but at least it now has a much lower
probability of reading past the end of the buffer even when data races
do occur.
@rwgk
Copy link
Collaborator

rwgk commented May 2, 2021

Thanks Pieter! I just triggered a new CI run and will also try again with our sanitizers (but I think it'll be fine now).

I'm intrigued by what you wrote before:

so if you want any sort of meaningful output, you'll have to provide your own synchronization around calls to cout as a user.
So in practice, I don't think the current implementation scoped_ostream_redirect is that problematic.

What do you think about changing test_iostream.cpp accordingly? — The current situation of testing something we're sure is broken doesn't make much sense (IIUC). In a separate PR? Could you help out? Or outline what we need to do / guide me?

rwgk pushed a commit to rwgk/pybind11 that referenced this pull request May 3, 2021
@rwgk
Copy link
Collaborator

rwgk commented May 3, 2021

Hi @tttapa and @henryiii, in hopes of getting rid of the CI flake, I went ahead with a very simple experiment (applied on top of this PR):

Simple-minded systematic insertion of std::lock_guard before all std::cout, std::cerr.

Is this an OK way to "provide your own synchronization" as Pieter suggested?

EDIT: I forgot to mention, this fixes the TSAN error reported under #2754.

@tttapa
Copy link
Contributor Author

tttapa commented May 3, 2021

Consider the following code:

#include <iostream>
#include <thread>

int main() {
    auto task = [](const char *thrd_name, char data) {
        for (char c = data; c < data + 6; c++)
            std::cout << "Thread " << thrd_name << " says: " << c << "\r\n";
    };
    std::jthread t1(task, "Fred", 'a');
    std::jthread t2(task, "Bob", 'A');
}

IIUC, the standard guarantees that this is not undefined behavior. Still, when I ran it, I got:

Thread Fred says: Thread aBob says: A

Thread BobThread Fred says:  says: b
Thread Fred says: Bc

Thread Thread BobFred says:  says: Cd

Thread BobThread  says: D
Thread BobFred says: E
Thread  says: Bobe
Thread Fred says:  says: f
F

So even though there are no data races, and each character I sent to cout was indeed printed exactly once, the output is still garbage. If you want to write to cout from multiple threads (and be able to read the results), you need to lock the stream for each line or logical unit you write to it.

Given the fact that this synchronization has to be provided by the user if he wants meaningful results, I think it's not unreasonable to require pybind11 users to lock all (possibly) concurrent writes to cout, since they would have to do that anyway if they wanted readable output.
With this caveat to the user, the pybind11 implementation of pythonbuf then no longer has to be able to receive data from multiple threads concurrently (it would still have to acquire the GIL though), so you could keep the current implementation.

However, if a user does write to cout from multiple threads without locking in his/her code, then using the current implementation of the scoped_ostream_redirect, this results in data races, as demonstrated by TSAN. This technically breaks the standard's guarantee that writing to cout concurrently is race-free. This is worse than the garbled output above, because now we're in undefined behavior territory, realistically, the worst that'll probably happen is that some characters might not be printed, we might read 1 char past the end of the buffer, we might get characters that were never printed, but we can't rule out that the program might just crash or behave erratically in other ways. But at least you could point to the pybind11 docs and say you warned them.

If you want to provide the same guarantees as the standard, i.e. concurrent writes result in interleaved characters but no races, then you'll have to lock in pybind11's implementation of pythonbuf.
For well-written C++ code, this doesn't bring anything new to the table, it just adds a third lock (one for cout in the user code, another one in pythonbuf::xsputn/overflow/sync for access to the buffer and the GIL in pythonbuf::sync before calling back to Python).
For poorly written C++ code, like the example above, it makes the difference between printing garbage with data races and printing garbage without data races.

With respect to the changes you made to the tests, I'm not sure if that's necessary. Since most tests are single-threaded, there's no need for the locks there. In the threaded test, it kind of defeats the point of the test: by adding the lock around the cout << "x" << flush, you know that the pythonbuf::sync function will be called in mutual exclusivity, whereas you actually wanted to test if that function was able to handle concurrent calls.

I'm not sure where to go from here though. In a perfect world, you would implement a lock in pythonbuf and the threaded test would succeed with TSAN (without having to add locks to test_iostream.cpp).
In a world where developer time is finite and where everything is a compromise, I'd probably just tell users “add a lock around your print statements if you want to print from multiple threads”, and disable the failing threaded test (because it can't possibly succeed with the current implementation of pythonbuf).

@rwgk
Copy link
Collaborator

rwgk commented May 4, 2021

Thanks a lot for the great explanation!

For poorly written C++ code, like the example above, it makes the difference between printing garbage with data races and printing garbage without data races.

I don't think that difference is worth anyone's time, or the added code complexity.

In a world where developer time is finite and where everything is a compromise

Absolutely. I will merge your change now. — I feel it's best to document that iostream.h only works with proper user-side locking, and to completely remove the threaded test along with removing the duplicate if (pbase() != pptr()), because it could mislead drive-by readers to believe that we support redirects without user-side locking.

@rwgk rwgk merged commit 0c93a0f into pybind:master May 4, 2021
@github-actions github-actions bot added the needs changelog Possibly needs a changelog entry label May 4, 2021
@henryiii henryiii removed the needs changelog Possibly needs a changelog entry label Jul 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants