-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
std: Change encode_utf{8,16}
to return iterators
#32204
std: Change encode_utf{8,16}
to return iterators
#32204
Conversation
r? @aturon cc @SimonSapin |
fn encode_utf8(self) -> EncodeUtf8 { | ||
let code = self as u32; | ||
let mut buf = [0; 4]; | ||
let pos = if code < MAX_ONE_B && !buf.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of !buf.is_empty()
? Isn't it always true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops! Just a holdover from the copy/pasted code below.
2b3c5ac
to
e3581c8
Compare
fn encode_utf8(self, dst: &mut [u8]) -> Option<usize>; | ||
#[stable(feature = "core", since = "1.6.0")] | ||
fn encode_utf16(self, dst: &mut [u16]) -> Option<usize>; | ||
#[unstable(feature = "unicode", issue = "27784")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing #[stable]
to #[unstable]
? Is that acceptable because the trait is #[unstable]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right yes thanks for the reminder! I meant to discuss this more at length on the PR comments but forgot.
So it looks like char::encode_utf8
is unstable (as expected), but we accidentally stabilized CharExt::encode_utf8
in libcore. This appears to be accidental (from what I can tell), so it seems that we're within our rights to switch back to unstable here (with libstd being the "source of truth")
That being said I wouldn't mind doing a crater run to see if it affects any crates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be accidental (from what I can tell), so it seems that we're within our rights to switch back to unstable here (with libstd being the "source of truth")
That reasoning seems problematic. If a stable channel compiler lets me use some feature/item because it is marked #[stable]
, I have no way to know whether it is "accidentally" stable. Going back to #[unstable]
effectively removes an item as far as the stable channel is concerned, which RFC 1105 says is a major change.
I thought this specific case was OK since the CharExt
trait is #[unstable]
. If I can’t write use core::char::CharExt;
I can’t use CharExt
’s methods. But I don’t need that use
, it’s in the core
prelude.
Maybe these methods have been stable a short enough time that little enough code uses them that the breakage is acceptable. I’m in favor of this change, but “oh but we didn’t mean that stability promise we made” should not be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SimonSapin this was a mistake when stabilizing libcore. If it causes breakage we can back out and rethink our strategy, but the hypothesis is that this won't cause breakage as the standard library's copy has been unstable so anyone using stable has been avoiding it anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, in general it's obviously not OK to revert a stabilization just because it was an accident. But the fact that the stable item was isolated to libcore and not stable in libstd meant that the impact of correction should be very low.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hah, I see @alexcrichton beat me to basically the same point...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a separate issue was opened for this: #32460
Oof, sorry for the delay, this got buried in my inbox. Nice improvements here. I don't think we need a crater run for the libcore stability changes -- I'm fine to land as-is. @bors: r+ |
📌 Commit e3581c8 has been approved by |
⌛ Testing commit e3581c8 with merge 93ee81b... |
💔 Test failed - auto-linux-64-x-armhf |
Currently these have non-traditional APIs which take a buffer and report how much was filled in, but they're not necessarily ergonomic to use. Returning an iterator which *also* exposes an underlying slice shouldn't result in any performance loss as it's just a lazy version of the same implementation, and it's also much more ergonomic! cc rust-lang#27784
e3581c8
to
48d5fe9
Compare
…turon std: Change `encode_utf{8,16}` to return iterators Currently these have non-traditional APIs which take a buffer and report how much was filled in, but they're not necessarily ergonomic to use. Returning an iterator which *also* exposes an underlying slice shouldn't result in any performance loss as it's just a lazy version of the same implementation, and it's also much more ergonomic! cc #27784
💔 Test failed - auto-win-gnu-32-nopt-t |
@bors: retry On Tue, Mar 22, 2016 at 2:44 PM, bors notifications@github.com wrote:
|
@bors: retry force clean |
…turon std: Change `encode_utf{8,16}` to return iterators Currently these have non-traditional APIs which take a buffer and report how much was filled in, but they're not necessarily ergonomic to use. Returning an iterator which *also* exposes an underlying slice shouldn't result in any performance loss as it's just a lazy version of the same implementation, and it's also much more ergonomic! cc #27784
self.bytes.set_len(cur_len + used); | ||
} | ||
let bytes = unsafe { | ||
char::from_u32_unchecked(code_point.value).encode_utf8() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if only "for a short time" this can create an invalid char
(which can be a surrogate). Is that really OK? Doesn’t char
provide range assertions to LLVM? Previously we had encode_utf8_raw
taking u32
to avoid this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right. I'd rather not expose those methods though from libcore again and we can just inline them here for now if we need to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. But I’m asking, because I really don’t know: do we need to? Is this one of those things that appear to work when you try it but is technically undefined behavior and can cause bad things in future compiler versions?
I still have nigthly CI for https://github.com/SimonSapin/rust-wtf8 which has similar code, failing now that encode_utf8_raw
is gone, and I’m unsure what to do about it. And I imagine other libraries may be tempted to do something similar: I recall the regex
crate doing some contortions internally to avoid creating a surrogate char
when "iterating" over character ranges like [a-z]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not actually sure either, it kinda depends what we tell LLVM, how we codgen this, and what operations happen.
Currently these have non-traditional APIs which take a buffer and report how
much was filled in, but they're not necessarily ergonomic to use. Returning an
iterator which also exposes an underlying slice shouldn't result in any
performance loss as it's just a lazy version of the same implementation, and
it's also much more ergonomic!
cc #27784