Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stdio: Handle unicode boundaries better on Windows #23344

Closed
alexcrichton opened this issue Mar 13, 2015 · 17 comments · Fixed by #58454
Closed

stdio: Handle unicode boundaries better on Windows #23344

alexcrichton opened this issue Mar 13, 2015 · 17 comments · Fixed by #58454
Labels
A-Unicode Area: Unicode C-bug Category: This is a bug. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@alexcrichton
Copy link
Member

The stdio support on windows has two separate modes. One is when stdout is not a console and implies byte orientation, the other is when stdout is a console and implies [u16] orientation. In the stdout-is-a-console case we require that all input and output to be valid unicode so we can translate [u8] to [u16].

In this console case, however, our translation may not be quite right for a number of scenarios:

  • When reading data, the console will give us a block of u16's which we need to translate to UTF-8. In theory the console can give us half of a surrogate pair (with the next half available on the next call to read), but we do not handle this case well currently. This should be fixed by simply adding a "buffer of size 1" to hold half of a surrogate pair if necessary.
  • When reading data, we require the entire set of input to be valid UTF-16. We should instead attempt to read as much of the input as possible as valid UTF-16, only returning an error for the actual invalid elements. For example if we read 10 elements, 5 of which are valid UTF-16, the 6th is bad, and then the remaining are all valid UTF-16, we should probably return the first 5 on a call to read, then return an error, then return the remaining on the next call to read.
  • When writing data, we require that the entire block of input is valid UTF-8. We should instead take a similar approach as above where we try to interpret as much of the input as possible as UTF-8, but we simply ignore the rest for that one call to read (returning an error if the first byte is invalid UTF-8).
  • When writing data, we don't handle the case where a multibyte character straddles calls to write. Like the reading case, this could be alleviated with a 4-byte buffer for unwritten-but-valid-utf8 characters.
  • When writing, we translate a block of valid UTF-8 to a block of UTF-16. Upon writing this UTF-16, however, not all characters may be written. There are two problems here:
    • We need to translate a length of UTF-16 characters to a number of UTF-8 bytes that were written.
    • If half of a surrogate pair is written, the other half should be "buffered" and the length of UTF-8 written should include the half-written character (as the extra data is buffered).

At this time I'm not sure what sort of guarantees Windows actually gives us on these APIs. For example does windows never deal with only half of a surrogate pair on a read/write? If Windows were to guarantee various properties like this (or that it always writes the entire buffer, for example), our lives would be a lot easier! For now, though, it is unclear what we can rely on, so this issue is written as if we rely on nothing.

@alexcrichton
Copy link
Member Author

cc @retep998, @aturon

@alexcrichton
Copy link
Member Author

The code for all this currently lives in src/libstd/sys/windows/stdio.rs, but it may move over time.

@retep998
Copy link
Member

The Windows console just deals with an opaque sequence of WCHAR so it offers no guarantees regarding how it gives you surrogates.

@bluss
Copy link
Member

bluss commented Jul 9, 2015

Some notes on Stdin as it is today:

There seem to be some inefficiencies, the allocation of a 4 kB vector can possibly be turned into an array on the stack. A new String is then allocated again from that result.

The MemReader comment should go, it used to talk about an unwrap that's no longer there.

@alexcrichton
Copy link
Member Author

As reported in #30399, an easy way to trigger this:

    let v = vec![255u8; 5000];
    let s = String::from_utf8_lossy(&v);
    // println!("Len {}", s.len());
    println!("{}", s);

@retep998
Copy link
Member

retep998 commented Jan 4, 2016

I just ran into this issue while doing cargo list --tree in one of my projects presumably because it prints out some box drawing characters and one of them straddled a write boundary.

thread '<main>' panicked at 'failed printing to stdout: text was not valid unicode', ../src/libstd\io\stdio.rs:588

@luser
Copy link
Contributor

luser commented Jun 23, 2016

This is a real bummer because you can't write non-UTF-8 data to stdout via the std::io APIs, which is not a limitation in C. The simplest example here would be a program that wants to write arbitrary binary data to stdout (expecting it to be piped somewhere), but I hit this today with my program that runs a command, captures its stdout, and then later writes that data to its own stdout. If the stdout from the command was not valid UTF-8 that fails. (The program in question was the Visual C++ installed with the French locale running in a Windows VM with the system locale set to US English.)

@luser
Copy link
Contributor

luser commented Jun 23, 2016

I gather this code is written this way because we want to be able to write Rust strings (which can contain Unicode) to the console, but it's kind of limiting.

@alexcrichton
Copy link
Member Author

alexcrichton commented Jun 23, 2016

@luser unfortunately (as I'm led to believe) that's actually a somewhat impossible operation on Windows. The captured output is all byte-oriented as that's how the stdout pipes are set up for a spawned program, but when printing that to a console (not a pipe) it's expected to be a u16-oriented stream, so a conversion needs to happen.

I'm not sure if there's a system API for writing bytes to a console, however, other than just the u16 units we're doing today.

@retep998
Copy link
Member

Sure you can write bytes to the console by calling WriteConsoleA instead of WriteConsoleW, but it'll be interpreted as whatever the current code page is. While you can change the console output code page using SetConsoleOutputCP, trying to use UTF-8 will still suffer as it also has issues with multi byte codepoints straddling write boundaries. Trying to use SetConsoleCP to read UTF-8 is even worse as multi byte codepoints simply do not work. Rust takes the approach of trying to be as unicode friendly as possible by using wide functions. If you don't like that default and think unicode is a waste of time, then by all means feel free to call the appropriate system functions and write to stdout yourself in whatever manner you feel like.

The simplest example here would be a program that wants to write arbitrary binary data to stdout (expecting it to be piped somewhere),

If stdout is piped then Rust won't check that what you're outputting is UTF-8 and will gladly just throw those bytes through the pipe. The only time it matters is when you're interacting with the Windows Console which is a terrible place to read/write arbitrary binary data, and if your program tries to spew binary data to the console then that's just bad design.

@luser
Copy link
Contributor

luser commented Jun 24, 2016

I think it's unfortunate that the issue I ran into (run a subprocess, capture its output, write that to stdout) doesn't work reliably out-of-the box, and I think that's pretty legitimate.

@retep998
Copy link
Member

It is working as reliably as it can. If you pass it UTF-8 data it does the conversion to UTF-16 and everything works fine. If you pass it data that isn't UTF-8, well now what? It certainly can't losslessly convert it to UTF-16, so that's out. As for writing it out with WriteConsoleA, well, what code page do you want to use? In C it "works" because it simply writes out the bytes using WriteConsoleA and doesn't care what the current code page is. That means what glyph your characters will display as depends on what the current code page is, thereby completely butchering unicode.

In your case, if the program you're capturing the output of is not outputting UTF-8 but rather some other code page, then the solution is for you to figure out which code page it is and use a crate like https://github.com/lifthrasiir/rust-encoding to convert it to UTF-8 before writing it out. If the output is binary, why the heck are you trying to write binary data to the console?

@luser
Copy link
Contributor

luser commented Jun 24, 2016

Presumably the output is already in the system codepage, but std::process::Output's stdout member is a Vec<u8>, just bytes, and the Write trait takes a &[u8], which is usually just bytes, so yes, it's surprising that somewhere in the middle there something is trying to treat the data as UTF-8.

And yes, I understand that system codepages are a mess, and the historical legacy of "everything is ASCII" in C is terrible, but this seems like a fairly straightforward use case, and to do what would be 2 lines in C I'll have to jump through a number of hoops in Rust.

@retep998
Copy link
Member

Rust has to default to using the wide versions so that Rust strings, which are known to always be UTF-8, can be written to the console correctly. Changing this behavior would be a serious regression for all the people that currently rely on unicode in the console.

Maybe one day libstd will have a more extensive API for interacting with stdout, including choosing how to deal with code pages, but that will take time and likely an RFC (feel free to create one if you can think of a good design), but until then your best option is to call WriteConsoleA yourself or create a library that does so for you, or even just call the libc functions if you really want.

@alexcrichton
Copy link
Member Author

@luser yeah it's true that that this is unfortunately harder to do than in C, but it's unfortunate fallout from our decision to be much more principled about UTF-8 (especially in strings). The interface here is "write an arbitrary blob of bytes to stdout" rather than "write this process's output to stdout". The former doesn't have much of an interpretation if it's not unicode, but the latter should almost always be possible to ferry through without incident.

It may be worth exploring attempting to interpret the input as unicode and wherever that fails call the ansi functions which take bytes instead. That way you may be interleaving calls to the byte and wide apis, but maybe it'd work out? That'd at least get us to an infallible write implementation, although may give incorrect results.

@retep998
Copy link
Member

It would most definitely give incorrect results when your input is in a code page that doesn't match the ascii portion of UTF-8, as during wide calls those bytes would be interpreted as UTF-8, while if some invalid UTF-8 is encountered it would do a narrow call and interpret those bytes as whatever the console code page is. I'd really prefer if we had separate APIs for writing unicode data and writing arbitrary blobs in the current code page.

@Mark-Simulacrum Mark-Simulacrum added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. and removed A-io labels Jun 25, 2017
@Mark-Simulacrum Mark-Simulacrum added A-Unicode Area: Unicode C-bug Category: This is a bug. labels Jul 22, 2017
@abonander
Copy link
Contributor

abonander commented Aug 28, 2018

In the interim, can we make the error message a little bit more Windows- and stdlib-specific? "text was not valid unicode" is really generic and can make it seem like the error is coming from somewhere other than the stdlib. I went so far as to grep my entire local Cargo cache to make sure it wasn't coming from a library somewhere in the below issue.

cc abonander/multipart#106

kennytm added a commit to kennytm/rust that referenced this issue Feb 7, 2019
…dtolnay

Improve error message and docs for non-UTF-8 bytes in stdio on Windows

This should make debugging problems like abonander/multipart#106 significantly more straightforward in the future.

cc rust-lang#23344, @retep998 @alexcrichton

Not sure who do r? so I'll let rust-highfive pick one.
Centril added a commit to Centril/rust that referenced this issue Feb 23, 2019
…hton

Refactor Windows stdio and remove stdin double buffering

I was looking for something nice and small to work on, tried to tackle a few FIXME's in Windows stdio, and things grew from there.

This part of the standard library contains some tricky code, and has changed over the years to handle more corner cases. It could use some refactoring and extra comments.

Changes/fixes:
- Made `StderrRaw` `pub(crate)`, to remove the `Write` implementations on `sys::Stderr` (used unsynchronised for panic output).
- Remove the unused `Read` implementation on `sys::windows::stdin`
- The `windows::stdio::Output` enum made sense when we cached the handles, but we can use simple functions like `is_console` now that we get the handle on every read/write
- `write` can now calculate the number of written bytes as UTF-8 when we can't write all `u16`s.
- If `write` could only write one half of a surrogate pair, attempt another write for the other because user code can't reslice in any way that would allow us to write it otherwise.
- Removed the double buffering on stdin. Documentation on the unexposed `StdinRaw` says: 'This handle is not synchronized or buffered in any fashion'; which is now true.
- `sys::windows::Stdin` now always only partially fills its buffer, so we can guarantee any arbitrary UTF-16 can be re-encoded without losing any data.
- `sys::windows::STDIN_BUF_SIZE` is slightly larger to compensate. There should be no real change in the number of syscalls the buffered `Stdin` does. This buffer is a little larger, while the extra buffer on Stdin is gone.
- `sys::windows::Stdin` now attempts to handle unpaired surrogates at its buffer boundary.
- `sys::windows::Stdin` no langer allocates for its buffer, but the UTF-16 decoding still does.

### Testing
I did some manual testing of reading and writing to console. The console does support UTF-16 in some sense, but doesn't supporting displaying characters outside the BMP.
- compile stage 1 stdlib with a tiny value for `MAX_BUFFER_SIZE` to make it easier to catch corner cases
- run a simple test program that reads on stdin, and echo's to stdout
- write some lines with plenty of ASCII and emoji in a text editor
- copy and paste in console to stdin
- return with `\r\n\` or CTRL-Z
- copy and paste in text editor
- check it round-trips

-----

Fixes rust-lang#23344. All but one of the suggestions in that issue are now implemented. the missing one is:

> * When reading data, we require the entire set of input to be valid UTF-16. We should instead attempt to read as much of the input as possible as valid UTF-16, only returning an error for the actual invalid elements. For example if we read 10 elements, 5 of which are valid UTF-16, the 6th is bad, and then the remaining are all valid UTF-16, we should probably return the first 5 on a call to `read`, then return an error, then return the remaining on the next call to `read`.

Stdin in Console mode is dealing with text directly input by a user. In my opinion getting an unpaired surrogate is quite unlikely in that case, and a valid reason to error on the entire line of input (which is probably short). Dealing with it is incompatible with an unbuffered stdin, which seems the more interesting guarantee to me.
Centril added a commit to Centril/rust that referenced this issue Feb 23, 2019
…hton

Refactor Windows stdio and remove stdin double buffering

I was looking for something nice and small to work on, tried to tackle a few FIXME's in Windows stdio, and things grew from there.

This part of the standard library contains some tricky code, and has changed over the years to handle more corner cases. It could use some refactoring and extra comments.

Changes/fixes:
- Made `StderrRaw` `pub(crate)`, to remove the `Write` implementations on `sys::Stderr` (used unsynchronised for panic output).
- Remove the unused `Read` implementation on `sys::windows::stdin`
- The `windows::stdio::Output` enum made sense when we cached the handles, but we can use simple functions like `is_console` now that we get the handle on every read/write
- `write` can now calculate the number of written bytes as UTF-8 when we can't write all `u16`s.
- If `write` could only write one half of a surrogate pair, attempt another write for the other because user code can't reslice in any way that would allow us to write it otherwise.
- Removed the double buffering on stdin. Documentation on the unexposed `StdinRaw` says: 'This handle is not synchronized or buffered in any fashion'; which is now true.
- `sys::windows::Stdin` now always only partially fills its buffer, so we can guarantee any arbitrary UTF-16 can be re-encoded without losing any data.
- `sys::windows::STDIN_BUF_SIZE` is slightly larger to compensate. There should be no real change in the number of syscalls the buffered `Stdin` does. This buffer is a little larger, while the extra buffer on Stdin is gone.
- `sys::windows::Stdin` now attempts to handle unpaired surrogates at its buffer boundary.
- `sys::windows::Stdin` no langer allocates for its buffer, but the UTF-16 decoding still does.

### Testing
I did some manual testing of reading and writing to console. The console does support UTF-16 in some sense, but doesn't supporting displaying characters outside the BMP.
- compile stage 1 stdlib with a tiny value for `MAX_BUFFER_SIZE` to make it easier to catch corner cases
- run a simple test program that reads on stdin, and echo's to stdout
- write some lines with plenty of ASCII and emoji in a text editor
- copy and paste in console to stdin
- return with `\r\n\` or CTRL-Z
- copy and paste in text editor
- check it round-trips

-----

Fixes rust-lang#23344. All but one of the suggestions in that issue are now implemented. the missing one is:

> * When reading data, we require the entire set of input to be valid UTF-16. We should instead attempt to read as much of the input as possible as valid UTF-16, only returning an error for the actual invalid elements. For example if we read 10 elements, 5 of which are valid UTF-16, the 6th is bad, and then the remaining are all valid UTF-16, we should probably return the first 5 on a call to `read`, then return an error, then return the remaining on the next call to `read`.

Stdin in Console mode is dealing with text directly input by a user. In my opinion getting an unpaired surrogate is quite unlikely in that case, and a valid reason to error on the entire line of input (which is probably short). Dealing with it is incompatible with an unbuffered stdin, which seems the more interesting guarantee to me.
Centril added a commit to Centril/rust that referenced this issue Feb 24, 2019
…hton

Refactor Windows stdio and remove stdin double buffering

I was looking for something nice and small to work on, tried to tackle a few FIXME's in Windows stdio, and things grew from there.

This part of the standard library contains some tricky code, and has changed over the years to handle more corner cases. It could use some refactoring and extra comments.

Changes/fixes:
- Made `StderrRaw` `pub(crate)`, to remove the `Write` implementations on `sys::Stderr` (used unsynchronised for panic output).
- Remove the unused `Read` implementation on `sys::windows::stdin`
- The `windows::stdio::Output` enum made sense when we cached the handles, but we can use simple functions like `is_console` now that we get the handle on every read/write
- `write` can now calculate the number of written bytes as UTF-8 when we can't write all `u16`s.
- If `write` could only write one half of a surrogate pair, attempt another write for the other because user code can't reslice in any way that would allow us to write it otherwise.
- Removed the double buffering on stdin. Documentation on the unexposed `StdinRaw` says: 'This handle is not synchronized or buffered in any fashion'; which is now true.
- `sys::windows::Stdin` now always only partially fills its buffer, so we can guarantee any arbitrary UTF-16 can be re-encoded without losing any data.
- `sys::windows::STDIN_BUF_SIZE` is slightly larger to compensate. There should be no real change in the number of syscalls the buffered `Stdin` does. This buffer is a little larger, while the extra buffer on Stdin is gone.
- `sys::windows::Stdin` now attempts to handle unpaired surrogates at its buffer boundary.
- `sys::windows::Stdin` no langer allocates for its buffer, but the UTF-16 decoding still does.

### Testing
I did some manual testing of reading and writing to console. The console does support UTF-16 in some sense, but doesn't supporting displaying characters outside the BMP.
- compile stage 1 stdlib with a tiny value for `MAX_BUFFER_SIZE` to make it easier to catch corner cases
- run a simple test program that reads on stdin, and echo's to stdout
- write some lines with plenty of ASCII and emoji in a text editor
- copy and paste in console to stdin
- return with `\r\n\` or CTRL-Z
- copy and paste in text editor
- check it round-trips

-----

Fixes rust-lang#23344. All but one of the suggestions in that issue are now implemented. the missing one is:

> * When reading data, we require the entire set of input to be valid UTF-16. We should instead attempt to read as much of the input as possible as valid UTF-16, only returning an error for the actual invalid elements. For example if we read 10 elements, 5 of which are valid UTF-16, the 6th is bad, and then the remaining are all valid UTF-16, we should probably return the first 5 on a call to `read`, then return an error, then return the remaining on the next call to `read`.

Stdin in Console mode is dealing with text directly input by a user. In my opinion getting an unpaired surrogate is quite unlikely in that case, and a valid reason to error on the entire line of input (which is probably short). Dealing with it is incompatible with an unbuffered stdin, which seems the more interesting guarantee to me.
Centril added a commit to Centril/rust that referenced this issue Feb 24, 2019
…hton

Refactor Windows stdio and remove stdin double buffering

I was looking for something nice and small to work on, tried to tackle a few FIXME's in Windows stdio, and things grew from there.

This part of the standard library contains some tricky code, and has changed over the years to handle more corner cases. It could use some refactoring and extra comments.

Changes/fixes:
- Made `StderrRaw` `pub(crate)`, to remove the `Write` implementations on `sys::Stderr` (used unsynchronised for panic output).
- Remove the unused `Read` implementation on `sys::windows::stdin`
- The `windows::stdio::Output` enum made sense when we cached the handles, but we can use simple functions like `is_console` now that we get the handle on every read/write
- `write` can now calculate the number of written bytes as UTF-8 when we can't write all `u16`s.
- If `write` could only write one half of a surrogate pair, attempt another write for the other because user code can't reslice in any way that would allow us to write it otherwise.
- Removed the double buffering on stdin. Documentation on the unexposed `StdinRaw` says: 'This handle is not synchronized or buffered in any fashion'; which is now true.
- `sys::windows::Stdin` now always only partially fills its buffer, so we can guarantee any arbitrary UTF-16 can be re-encoded without losing any data.
- `sys::windows::STDIN_BUF_SIZE` is slightly larger to compensate. There should be no real change in the number of syscalls the buffered `Stdin` does. This buffer is a little larger, while the extra buffer on Stdin is gone.
- `sys::windows::Stdin` now attempts to handle unpaired surrogates at its buffer boundary.
- `sys::windows::Stdin` no langer allocates for its buffer, but the UTF-16 decoding still does.

### Testing
I did some manual testing of reading and writing to console. The console does support UTF-16 in some sense, but doesn't supporting displaying characters outside the BMP.
- compile stage 1 stdlib with a tiny value for `MAX_BUFFER_SIZE` to make it easier to catch corner cases
- run a simple test program that reads on stdin, and echo's to stdout
- write some lines with plenty of ASCII and emoji in a text editor
- copy and paste in console to stdin
- return with `\r\n\` or CTRL-Z
- copy and paste in text editor
- check it round-trips

-----

Fixes rust-lang#23344. All but one of the suggestions in that issue are now implemented. the missing one is:

> * When reading data, we require the entire set of input to be valid UTF-16. We should instead attempt to read as much of the input as possible as valid UTF-16, only returning an error for the actual invalid elements. For example if we read 10 elements, 5 of which are valid UTF-16, the 6th is bad, and then the remaining are all valid UTF-16, we should probably return the first 5 on a call to `read`, then return an error, then return the remaining on the next call to `read`.

Stdin in Console mode is dealing with text directly input by a user. In my opinion getting an unpaired surrogate is quite unlikely in that case, and a valid reason to error on the entire line of input (which is probably short). Dealing with it is incompatible with an unbuffered stdin, which seems the more interesting guarantee to me.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Unicode Area: Unicode C-bug Category: This is a bug. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants