Add {floor,ceil}_char_boundary methods to str #86497

clarfonthey · 2021-06-20T20:32:39Z

This is technically already used internally by the standard library in the form of truncate_to_char_boundary.

Essentially these are two building blocks to allow for approximate string truncation, where you want to cut off the string at "approximately" a given length in bytes but don't know exactly where the character boundaries lie. It's also a good candidate for the standard library as it can easily be done naively, but would be difficult to properly optimise. Although the existing code that's done in error messages is done naively, this code will explicitly only check a window of 4 bytes since we know that a boundary must lie in that range, and because it will make it possible to vectorise.

Although this method doesn't take into account graphemes or other properties, this would still be a required building block for splitting that takes those into account. For example, if you wanted to split at a grapheme boundary, you could take your approximate splitting point and then determine the graphemes immediately following and preceeding the split. If you then notice that these two graphemes could be merged, you can decide to either include the whole grapheme or exclude it depending on whether you decide splitting should shrink or expand the string.

This takes the most conservative approach and just offers the raw indices to the user, and they can decide how to use them. That way, the methods are as useful as possible despite having as few methods as possible.

(Note: I'll add some tests and a tracking issue if it's decided that this is worth including.)

rust-highfive · 2021-06-20T20:32:41Z

r? @Mark-Simulacrum

(rust-highfive has picked a reviewer for you, use r? to override)

clarfonthey · 2021-06-21T15:21:56Z

(I know how to fix the error but my desktop is in literal pieces at the moment, so, I'll fix it this weekend.)

scottmcm · 2021-06-21T19:10:06Z

The functionality of these seems like a good thing to provide 👍

For the bikeshed, mixing "nearest" and "min/max" in the name is a bit confusing to me. Maybe char_boundary_begin and char_boundary_end or something? (And thus is_char_boundary == (char_boundary_begin == char_boundary_end).)

Or maybe, with subtly different semantics char_{begin|end}? So that s.char_begin(i) .. s.char_end(i) is the Range<usize> representing the USV that includes s.as_bytes()[i]?

That raises the question of endpoints, I guess. What's .nearest_char_boundary_min(usize::MAX) supposed to do?

clarfonthey · 2021-06-21T22:17:36Z

That raises the question of endpoints, I guess. What's .nearest_char_boundary_min(usize::MAX) supposed to do?

Forgetting the names (and I agree, it's hard to come up with something reasonable) the current behaviour here is to panic.

The one which is upper-bounded will currently not panic, which I think is good for truncating behaviour -- if you want your string to be under, say, 20 bytes, and it's empty, that's fine.

Lower-bounded ones must panic if the index is above len, since they can't go any lower.

If we were to use the concept of bounding the results to a specific character, then our issue is that we treat past the end as an "end of string" character, and this isn't C.

My initial terms for the methods were "largest char boundary not exceeding" and "smallest char boundary not below," and I figured those names sucked, but that's the intent that I wanted to preserve in the methods themselves.

library/core/src/str/mod.rs

Mark-Simulacrum · 2021-06-28T15:34:50Z

r? @dtolnay as T-libs-api to evaluate the addition of unstable API surface area. Implementations look OK to me modulo comments but I didn't evaluate the correctness of the search criteria (i.e., as casts and equality to -0x40).

clarfonthey · 2021-06-28T16:28:15Z

Note that the -0x40 was taken directly from is_char_boundary. I could factor out that bit into a separate private method (like byte_is_char_boundary) to avoid potential drift.

(Edit: I did that)

clarfonthey · 2021-06-30T16:03:57Z

That test passed for me locally, and I'm not 100% sure on how to fix it. I also know that u8::is_utf8_char_boundary should probably be inlined. Mostly want to wait for a more affirmative "yes, I like this API ± name bikeshedding" before I put more work into it.

library/core/src/str/mod.rs

clarfonthey · 2021-10-25T12:44:02Z

Rebased, would be nice if someone could take a look before I have to rebase again. 😉

falk-hueffner · 2021-11-06T10:20:04Z

library/core/src/str/mod.rs

@@ -213,7 +209,81 @@ impl str {
            None => index == self.len(),

            // This is bit magic equivalent to: b < 128 || b >= 192


This comment should probably move to is_utf8_char_boundary.

dtolnay

These method names and signatures look good to me. I was skeptical about the previous iterations but these are nice. Thanks for all the discussion above.

Please open a tracking issue and update the unstable attributes in the PR.

This is going to require more extensive test coverage prior to landing. See if you can find where the unit tests are for the other methods in this impl block.

library/core/src/str/mod.rs

crlf0710 · 2021-12-17T17:41:29Z

@clarfonthey Ping from triage, would you mind addressing the comments above? Thanks!

clarfonthey · 2021-12-29T01:42:50Z

Hey, sorry that I took so long getting to this. Will add some more tests and see what I can do about fixing the implementation.

clarfonthey · 2022-01-29T23:35:22Z

Okay, this is finally ready for a proper review. I added a (hopefully) thorough set of tests and fixed the issue with ceil_char_boundary, which also did result in the code being more readable (IMHO) anyway.

@rustbot ready

bors · 2022-02-06T13:07:03Z

☔ The latest upstream changes (presumably #90414) made this pull request unmergeable. Please resolve the merge conflicts.

scottmcm · 2022-02-06T17:44:15Z

This is looking pretty good to me, code-wise, now.

Can you please add the tracking issue (as mentioned in #86497 (review))? With that I think it'll be good to go.

clarfonthey · 2022-02-07T17:35:19Z

Tracking issue made, also fixed the test being called ceil_char_bonudary instead of ceil_char_boundary 😳

Should be ready for final CR

clarfonthey · 2022-02-07T18:34:47Z

(I had accidentally committed the wrong version of a submodule when rebasing, so, hopefully it doesn't fail this time.)

scottmcm · 2022-02-07T23:05:23Z

Thanks for your persistence here! These look good to go for nightly.

@bors r+

If anyone has further thoughts, please use the tracking issue #93743 or a new bug cc'ing it.

bors · 2022-02-07T23:05:25Z

📌 Commit edd318c has been approved by scottmcm

…askrgr Rollup of 9 pull requests Successful merges: - rust-lang#86497 (Add {floor,ceil}_char_boundary methods to str) - rust-lang#92695 (Add `#[no_coverage]` tests for nested functions) - rust-lang#93521 (Fix hover effects in sidebar) - rust-lang#93568 (Include all contents of first line of scraped item in Rustdoc) - rust-lang#93569 (rustdoc: correct unclosed HTML tags as generics) - rust-lang#93672 (update comment wrt const param defaults) - rust-lang#93715 (Fix horizontal trim for block doc comments) - rust-lang#93721 (rustdoc: Special-case macro lookups less) - rust-lang#93728 (Add in ValuePair::Term) Failed merges: r? `@ghost` `@rustbot` modify labels: rollup

rust-highfive assigned Mark-Simulacrum Jun 20, 2021

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jun 20, 2021

This comment has been minimized.

Sign in to view

clarfonthey force-pushed the nearest_char_boundary branch from 379b542 to 8699967 Compare June 20, 2021 20:37

This comment has been minimized.

Sign in to view

WiSaGaN mentioned this pull request Jun 25, 2021

Taking the first N bytes of a str that still make up valid UTF-8 rust-lang/rfcs#2566

Open

Mark-Simulacrum reviewed Jun 28, 2021

View reviewed changes

library/core/src/str/mod.rs Outdated Show resolved Hide resolved

library/core/src/str/mod.rs Outdated Show resolved Hide resolved

rust-highfive assigned dtolnay and unassigned Mark-Simulacrum Jun 28, 2021

clarfonthey force-pushed the nearest_char_boundary branch from 8699967 to d2b0e71 Compare June 29, 2021 01:53

This comment has been minimized.

Sign in to view

clarfonthey force-pushed the nearest_char_boundary branch from d2b0e71 to 9616038 Compare June 29, 2021 03:17

This comment has been minimized.

Sign in to view

clarfonthey force-pushed the nearest_char_boundary branch from 9616038 to 0b94cfd Compare June 29, 2021 03:40

This comment has been minimized.

Sign in to view

clarfonthey force-pushed the nearest_char_boundary branch from 0b94cfd to 77823d8 Compare June 29, 2021 03:54

This comment has been minimized.

Sign in to view

clarfonthey force-pushed the nearest_char_boundary branch from 77823d8 to c6f504b Compare June 29, 2021 16:26

This comment has been minimized.

Sign in to view

bluss reviewed Jul 1, 2021

View reviewed changes

library/core/src/str/mod.rs Show resolved Hide resolved

clarfonthey force-pushed the nearest_char_boundary branch 2 times, most recently from a814a25 to 8b6d128 Compare July 1, 2021 21:58

This comment has been minimized.

Sign in to view

falk-hueffner reviewed Nov 6, 2021

View reviewed changes

dtolnay requested changes Dec 1, 2021

View reviewed changes

library/core/src/str/mod.rs Outdated Show resolved Hide resolved

library/core/src/str/mod.rs Outdated Show resolved Hide resolved

dtolnay added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Dec 1, 2021

clarfonthey force-pushed the nearest_char_boundary branch from b943ffa to 3cbbbc1 Compare January 29, 2022 23:34

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jan 29, 2022

clarfonthey force-pushed the nearest_char_boundary branch from 3cbbbc1 to 3681dd0 Compare February 7, 2022 17:30

clarfonthey mentioned this pull request Feb 7, 2022

Tracking Issue for str::{floor, ceil}_char_boundary #93743

Open

3 tasks

clarfonthey force-pushed the nearest_char_boundary branch from 3681dd0 to e700670 Compare February 7, 2022 17:35

This comment has been minimized.

Sign in to view

Add {floor,ceil}_char_boundary methods to str

edd318c

clarfonthey force-pushed the nearest_char_boundary branch from e700670 to edd318c Compare February 7, 2022 18:34

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 7, 2022

matthiaskrgr mentioned this pull request Feb 8, 2022

Rollup of 9 pull requests #93762

Merged

bors merged commit 1f841fc into rust-lang:master Feb 8, 2022

rustbot added this to the 1.60.0 milestone Feb 8, 2022

clarfonthey deleted the nearest_char_boundary branch February 10, 2022 12:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add {floor,ceil}_char_boundary methods to str #86497

Add {floor,ceil}_char_boundary methods to str #86497

clarfonthey commented Jun 20, 2021 •

edited

Loading

rust-highfive commented Jun 20, 2021

This comment has been minimized.

This comment has been minimized.

clarfonthey commented Jun 21, 2021

scottmcm commented Jun 21, 2021

clarfonthey commented Jun 21, 2021

Mark-Simulacrum commented Jun 28, 2021 •

edited

Loading

clarfonthey commented Jun 28, 2021 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

clarfonthey commented Jun 30, 2021 •

edited

Loading

This comment has been minimized.

clarfonthey commented Oct 25, 2021

falk-hueffner Nov 6, 2021

dtolnay left a comment

crlf0710 commented Dec 17, 2021

clarfonthey commented Dec 29, 2021

clarfonthey commented Jan 29, 2022

bors commented Feb 6, 2022

scottmcm commented Feb 6, 2022

clarfonthey commented Feb 7, 2022

This comment has been minimized.

clarfonthey commented Feb 7, 2022

scottmcm commented Feb 7, 2022

bors commented Feb 7, 2022

		@@ -213,7 +209,81 @@ impl str {
		None => index == self.len(),

		// This is bit magic equivalent to: b < 128 \|\| b >= 192

Add {floor,ceil}_char_boundary methods to str #86497

Add {floor,ceil}_char_boundary methods to str #86497

Conversation

clarfonthey commented Jun 20, 2021 • edited Loading

rust-highfive commented Jun 20, 2021

This comment has been minimized.

This comment has been minimized.

clarfonthey commented Jun 21, 2021

scottmcm commented Jun 21, 2021

clarfonthey commented Jun 21, 2021

Mark-Simulacrum commented Jun 28, 2021 • edited Loading

clarfonthey commented Jun 28, 2021 • edited Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

clarfonthey commented Jun 30, 2021 • edited Loading

This comment has been minimized.

clarfonthey commented Oct 25, 2021

falk-hueffner Nov 6, 2021

Choose a reason for hiding this comment

dtolnay left a comment

Choose a reason for hiding this comment

crlf0710 commented Dec 17, 2021

clarfonthey commented Dec 29, 2021

clarfonthey commented Jan 29, 2022

bors commented Feb 6, 2022

scottmcm commented Feb 6, 2022

clarfonthey commented Feb 7, 2022

This comment has been minimized.

clarfonthey commented Feb 7, 2022

scottmcm commented Feb 7, 2022

bors commented Feb 7, 2022

clarfonthey commented Jun 20, 2021 •

edited

Loading

Mark-Simulacrum commented Jun 28, 2021 •

edited

Loading

clarfonthey commented Jun 28, 2021 •

edited

Loading

clarfonthey commented Jun 30, 2021 •

edited

Loading