Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taking the first N bytes of a str that still make up valid UTF-8 #2566

Open
thomcc opened this issue Oct 16, 2018 · 10 comments
Open

Taking the first N bytes of a str that still make up valid UTF-8 #2566

thomcc opened this issue Oct 16, 2018 · 10 comments
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.

Comments

@thomcc
Copy link
Member

thomcc commented Oct 16, 2018

In the past I've done stuff like &s[..max_len.min(s.len())] to truncate strings, but it turns out this is subtly broken (and will panic) for strings where max_len happens to be in the middle of a multibyte utf8-sequence (e.g. for the case !s.is_char_boundary(max_len)).

I've made a utility function for this (below), but it would be nice if a method on str existed for this case. In particular, I think the fact that the naive solution is broken on non-ASCII text makes it worthwhile, since developers are less likely to test on such text.

I have no opinions on its name (I'm genuinely terrible at names), nor on further extensions / variations or anything like that.

Anyway, below is the source for my version of it, provided mostly as to be completely clear on what I'm talking about. I think in practice this would go as a method on str and so would have a somewhat different implementation.

pub fn slice_up_to(s: &str, max_len: usize) -> &str {
    if max_len >= s.len() {
        return s;
    }
    let mut idx = max_len;
    while !s.is_char_boundary(idx) {
        idx -= 1;
    }
    &s[..idx]
}
@SimonSapin
Copy link
Contributor

This kinda feels a bit too niche to be in std rather than crates.io. In particular, why do you need this operation in the first place? Where does max_len come from? Do you really want up to some number of bytes, rather than characters? If the latter, given things like Unicode combining code points, what does "character" really mean in this context?

@eaglgenes101
Copy link

eaglgenes101 commented Oct 16, 2018

Also, such a function may break the string on boundaries that are considered nonsensical, such as between diacritical marks. Would be really weird for the substring to be missing a diacritic on the last character, and the substring right after it to have an orphaned diacritic at the beginning.

If the concern is transmission/storage, UTF-8 is already well-equipped for processing as the bytes come, as the encoding has, in each byte, metadata saying either that the byte continues an already-started codepoint, or that the byte starts a new codepoint, along with that codepoint's exact byte length. Once a codepoint arrives in full and is consumed, further bytes will never invalidate it.

@cramertj
Copy link
Member

@SimonSapin
Copy link
Contributor

@cramertj Can you say more about how and why it’s used? And my other questions above.

@le-jzr
Copy link

le-jzr commented Oct 18, 2018

At the very least, it should split at grapheme cluster boundaries. Unicode code point is not the same as visible character.

@Diggsey
Copy link
Contributor

Diggsey commented Oct 18, 2018

Presumably any time you need to fill a fixed size buffer with text, and either have no way to return the overflow, or simply don't want to have to preserve the encoder/decoder state when encoding/decoding across multiple buffers.

@thomcc
Copy link
Member Author

thomcc commented Oct 19, 2018

Sorry about vanishing there, had a very busy week at work and haven't had the time to elaborate until now.

Do you really want up to some number of bytes, rather than characters?

For my case, yes. I don't think this should consider characters, since as you mention, there are multiple things one may want in terms of characters. Anyway, most (all?) functions that take indices on str take byte indices, which IMO is the correct call.

Why do you need this operation in the first place? Where does max_len come from?

I've hit it a couple times (although until the most recent time I didn't notice the unicode bug), usually when dealing with filling text into a buffer, or when truncating a string before performing a somewhat expensive operation on it (in this case, it's a match operation performed on very many strings, most of which are short, but the long ones might be very long and full of nonsense).

More generally, I feel that the rationale behind having an is_char_boundary function is similar, and also:

This kinda feels a bit too niche to be in std rather than crates.io

The benefit of being in std is that the issue is not obvious at first glance. If there were an idiomatic method on str for this, subtle unicode bugs in code could be prevented (e.g. it's existence would encourage more correct code).

A possibly less niche function that might assist here instead would be something like a str::prev_char_boundary(&self, index: usize) -> usize function which takes a byte index which may be in the middle of a char, and returns the previous valid byte index. E.g.

// Note: I haven't tested this, and typed it directly into github
impl str {
    // ...
    pub fn prev_char_boundary(&self, mut index: usize) -> usize {
        if index >= self.len() { return self.len(); } // Or maybe it should assert. Dunno.
        while !self.is_char_boundary(index) {
            index -= 1;
        }
        index
    }
    // ...
}

This wouldn't really help with encouraging correct code, but it would make fixing the issue easier when you do find it.

@scottmcm scottmcm added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Nov 2, 2018
@bluss
Copy link
Member

bluss commented Dec 8, 2018

FWIW, str itself uses this to put some kind of simple limit on its panic message for slicing errors @ truncate_to_char_boundary.

@WiSaGaN
Copy link
Contributor

WiSaGaN commented Jan 18, 2019

@bluss that's exactly where I copy the code from to use in my project.

@SimonSapin I mostly use this when I want to construct a fixed capacity string in stack (e.g. ArrayString from bluss' arrayvec) from a maybe unlimited length string, and I am OK with truncating if the capacity is not enough. I won't say this usage is very frequent, but without this, the &str API feels somewhat incomplete.

Edit: On a second thought I think my usage seems to be more closely related to ArrayString as it is used only in place where we have both a fixed capacity and a string instead of bytes, so seems OK to me for it to be just added to ArrayString as a method (or for constructor and a fill in method).

@WiSaGaN
Copy link
Contributor

WiSaGaN commented Jun 25, 2021

Similar methods are being added in rust-lang/rust#86497

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

No branches or pull requests

9 participants