Taking the first N bytes of a `str` that still make up valid UTF-8 #2566

thomcc · 2018-10-16T18:18:15Z

In the past I've done stuff like &s[..max_len.min(s.len())] to truncate strings, but it turns out this is subtly broken (and will panic) for strings where max_len happens to be in the middle of a multibyte utf8-sequence (e.g. for the case !s.is_char_boundary(max_len)).

I've made a utility function for this (below), but it would be nice if a method on str existed for this case. In particular, I think the fact that the naive solution is broken on non-ASCII text makes it worthwhile, since developers are less likely to test on such text.

I have no opinions on its name (I'm genuinely terrible at names), nor on further extensions / variations or anything like that.

Anyway, below is the source for my version of it, provided mostly as to be completely clear on what I'm talking about. I think in practice this would go as a method on str and so would have a somewhat different implementation.

pub fn slice_up_to(s: &str, max_len: usize) -> &str {
    if max_len >= s.len() {
        return s;
    }
    let mut idx = max_len;
    while !s.is_char_boundary(idx) {
        idx -= 1;
    }
    &s[..idx]
}

The text was updated successfully, but these errors were encountered:

SimonSapin · 2018-10-16T18:50:02Z

This kinda feels a bit too niche to be in std rather than crates.io. In particular, why do you need this operation in the first place? Where does max_len come from? Do you really want up to some number of bytes, rather than characters? If the latter, given things like Unicode combining code points, what does "character" really mean in this context?

eaglgenes101 · 2018-10-16T19:17:38Z

Also, such a function may break the string on boundaries that are considered nonsensical, such as between diacritical marks. Would be really weird for the substring to be missing a diacritic on the last character, and the substring right after it to have an orphaned diacritic at the beginning.

If the concern is transmission/storage, UTF-8 is already well-equipped for processing as the bytes come, as the encoding has, in each byte, metadata saying either that the byte continues an already-started codepoint, or that the byte starts a new codepoint, along with that codepoint's exact byte length. Once a codepoint arrives in full and is consumed, further bytes will never invalidate it.

cramertj · 2018-10-17T23:14:04Z

FWIW, i've also needed and written this function before.

SimonSapin · 2018-10-18T06:22:49Z

@cramertj Can you say more about how and why it’s used? And my other questions above.

le-jzr · 2018-10-18T13:13:08Z

At the very least, it should split at grapheme cluster boundaries. Unicode code point is not the same as visible character.

Diggsey · 2018-10-18T13:30:24Z

Presumably any time you need to fill a fixed size buffer with text, and either have no way to return the overflow, or simply don't want to have to preserve the encoder/decoder state when encoding/decoding across multiple buffers.

thomcc · 2018-10-19T19:26:03Z

Sorry about vanishing there, had a very busy week at work and haven't had the time to elaborate until now.

Do you really want up to some number of bytes, rather than characters?

For my case, yes. I don't think this should consider characters, since as you mention, there are multiple things one may want in terms of characters. Anyway, most (all?) functions that take indices on str take byte indices, which IMO is the correct call.

Why do you need this operation in the first place? Where does max_len come from?

I've hit it a couple times (although until the most recent time I didn't notice the unicode bug), usually when dealing with filling text into a buffer, or when truncating a string before performing a somewhat expensive operation on it (in this case, it's a match operation performed on very many strings, most of which are short, but the long ones might be very long and full of nonsense).

More generally, I feel that the rationale behind having an is_char_boundary function is similar, and also:

This kinda feels a bit too niche to be in std rather than crates.io

The benefit of being in std is that the issue is not obvious at first glance. If there were an idiomatic method on str for this, subtle unicode bugs in code could be prevented (e.g. it's existence would encourage more correct code).

A possibly less niche function that might assist here instead would be something like a str::prev_char_boundary(&self, index: usize) -> usize function which takes a byte index which may be in the middle of a char, and returns the previous valid byte index. E.g.

// Note: I haven't tested this, and typed it directly into github
impl str {
    // ...
    pub fn prev_char_boundary(&self, mut index: usize) -> usize {
        if index >= self.len() { return self.len(); } // Or maybe it should assert. Dunno.
        while !self.is_char_boundary(index) {
            index -= 1;
        }
        index
    }
    // ...
}

This wouldn't really help with encouraging correct code, but it would make fixing the issue easier when you do find it.

bluss · 2018-12-08T15:06:41Z

FWIW, str itself uses this to put some kind of simple limit on its panic message for slicing errors @ truncate_to_char_boundary.

WiSaGaN · 2019-01-18T05:23:15Z

@bluss that's exactly where I copy the code from to use in my project.

@SimonSapin I mostly use this when I want to construct a fixed capacity string in stack (e.g. ArrayString from bluss' arrayvec) from a maybe unlimited length string, and I am OK with truncating if the capacity is not enough. I won't say this usage is very frequent, but without this, the &str API feels somewhat incomplete.

Edit: On a second thought I think my usage seems to be more closely related to ArrayString as it is used only in place where we have both a fixed capacity and a string instead of bytes, so seems OK to me for it to be just added to ArrayString as a method (or for constructor and a fill in method).

WiSaGaN · 2021-06-25T20:58:13Z

Similar methods are being added in rust-lang/rust#86497

scottmcm added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Nov 2, 2018

piodul mentioned this issue Sep 27, 2022

Add tracing subscriber to enable logging scylladb/cpp-rust-driver#78

Merged

4 tasks

konstin mentioned this issue Jul 7, 2023

Move file-level rule exemption to lexer-based approach astral-sh/ruff#5567

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taking the first N bytes of a `str` that still make up valid UTF-8 #2566

Taking the first N bytes of a `str` that still make up valid UTF-8 #2566

thomcc commented Oct 16, 2018

SimonSapin commented Oct 16, 2018

eaglgenes101 commented Oct 16, 2018 •

edited

Loading

cramertj commented Oct 17, 2018

SimonSapin commented Oct 18, 2018

le-jzr commented Oct 18, 2018

Diggsey commented Oct 18, 2018

thomcc commented Oct 19, 2018

bluss commented Dec 8, 2018 •

edited

Loading

WiSaGaN commented Jan 18, 2019 •

edited

Loading

WiSaGaN commented Jun 25, 2021

Taking the first N bytes of a str that still make up valid UTF-8 #2566

Taking the first N bytes of a str that still make up valid UTF-8 #2566

Comments

thomcc commented Oct 16, 2018

SimonSapin commented Oct 16, 2018

eaglgenes101 commented Oct 16, 2018 • edited Loading

cramertj commented Oct 17, 2018

SimonSapin commented Oct 18, 2018

le-jzr commented Oct 18, 2018

Diggsey commented Oct 18, 2018

thomcc commented Oct 19, 2018

bluss commented Dec 8, 2018 • edited Loading

WiSaGaN commented Jan 18, 2019 • edited Loading

WiSaGaN commented Jun 25, 2021

Taking the first N bytes of a `str` that still make up valid UTF-8 #2566

Taking the first N bytes of a `str` that still make up valid UTF-8 #2566

eaglgenes101 commented Oct 16, 2018 •

edited

Loading

bluss commented Dec 8, 2018 •

edited

Loading

WiSaGaN commented Jan 18, 2019 •

edited

Loading