-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lossy consistency #56787
Lossy consistency #56787
Conversation
r? @KodrAus (rust_highfive has picked a reviewer for you, use r? to override) |
edit, the first commit is pull request #56142 which I had to base one minor change on; that pull request has been accepted but not yet merged |
The job Click to expand the log.
I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact |
On Windows, `std` lossily converts an unpaired surrogate encoding into a single unicode replacement character, whereas on Unix, and also via `String`'s raw byte conversion, three unicode replacement characters are used. This comes from the fact that different code paths are used which take different approaches: `std::sys_common::wtf8::to_string_lossy` for Windows `OsStr` lossy conversion, vs. `core`'s `run_utf8_validation` function for Unix and for creaing a `String` from raw bytes. The inconsistency causes a problem in correct short option argument byte consumption tracking in our `OsStr` parsing code, due to using lossy `OsStr` conversion in combination with `std::str::from_utf8`. A [bug report][1] and [pull request][2] have been filed against `std`. [1]: rust-lang/rust#56786 [2]: rust-lang/rust#56787
d246403
to
7dd99b5
Compare
update: rebased now that #56142 has been successfully merged |
The job Click to expand the log.
I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact |
With this commit, lossy UTF-8 conversion of OsStr/OsString on Windows will output three Unicode replacement characters (U+FFFD), one per byte, for surrogate byte sequences, instead of just one, making it consistent with lossy conversion on Unix and with the lossy conversion of raw bytes sequences. fixes rust-lang#56786
7dd99b5
to
a6a3871
Compare
updated: forgot to update the test block at the bottom of the page |
closing, as per #56786 |
fixes #56786
this is a change in behaviour