-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Taking the first N bytes of a str
that still make up valid UTF-8
#2566
Comments
This kinda feels a bit too niche to be in |
Also, such a function may break the string on boundaries that are considered nonsensical, such as between diacritical marks. Would be really weird for the substring to be missing a diacritic on the last character, and the substring right after it to have an orphaned diacritic at the beginning. If the concern is transmission/storage, UTF-8 is already well-equipped for processing as the bytes come, as the encoding has, in each byte, metadata saying either that the byte continues an already-started codepoint, or that the byte starts a new codepoint, along with that codepoint's exact byte length. Once a codepoint arrives in full and is consumed, further bytes will never invalidate it. |
@cramertj Can you say more about how and why it’s used? And my other questions above. |
At the very least, it should split at grapheme cluster boundaries. Unicode code point is not the same as visible character. |
Presumably any time you need to fill a fixed size buffer with text, and either have no way to return the overflow, or simply don't want to have to preserve the encoder/decoder state when encoding/decoding across multiple buffers. |
Sorry about vanishing there, had a very busy week at work and haven't had the time to elaborate until now.
For my case, yes. I don't think this should consider characters, since as you mention, there are multiple things one may want in terms of characters. Anyway, most (all?) functions that take indices on
I've hit it a couple times (although until the most recent time I didn't notice the unicode bug), usually when dealing with filling text into a buffer, or when truncating a string before performing a somewhat expensive operation on it (in this case, it's a match operation performed on very many strings, most of which are short, but the long ones might be very long and full of nonsense). More generally, I feel that the rationale behind having an
The benefit of being in std is that the issue is not obvious at first glance. If there were an idiomatic method on A possibly less niche function that might assist here instead would be something like a // Note: I haven't tested this, and typed it directly into github
impl str {
// ...
pub fn prev_char_boundary(&self, mut index: usize) -> usize {
if index >= self.len() { return self.len(); } // Or maybe it should assert. Dunno.
while !self.is_char_boundary(index) {
index -= 1;
}
index
}
// ...
} This wouldn't really help with encouraging correct code, but it would make fixing the issue easier when you do find it. |
FWIW, |
@bluss that's exactly where I copy the code from to use in my project. @SimonSapin I mostly use this when I want to construct a fixed capacity string in stack (e.g. Edit: On a second thought I think my usage seems to be more closely related to |
Similar methods are being added in rust-lang/rust#86497 |
In the past I've done stuff like
&s[..max_len.min(s.len())]
to truncate strings, but it turns out this is subtly broken (and will panic) for strings wheremax_len
happens to be in the middle of a multibyte utf8-sequence (e.g. for the case!s.is_char_boundary(max_len)
).I've made a utility function for this (below), but it would be nice if a method on
str
existed for this case. In particular, I think the fact that the naive solution is broken on non-ASCII text makes it worthwhile, since developers are less likely to test on such text.I have no opinions on its name (I'm genuinely terrible at names), nor on further extensions / variations or anything like that.
Anyway, below is the source for my version of it, provided mostly as to be completely clear on what I'm talking about. I think in practice this would go as a method on
str
and so would have a somewhat different implementation.The text was updated successfully, but these errors were encountered: