-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add {floor,ceil}_char_boundary methods to str #86497
Conversation
(rust-highfive has picked a reviewer for you, use r? to override) |
This comment has been minimized.
This comment has been minimized.
379b542
to
8699967
Compare
This comment has been minimized.
This comment has been minimized.
(I know how to fix the error but my desktop is in literal pieces at the moment, so, I'll fix it this weekend.) |
The functionality of these seems like a good thing to provide 👍 For the bikeshed, mixing "nearest" and "min/max" in the name is a bit confusing to me. Maybe Or maybe, with subtly different semantics That raises the question of endpoints, I guess. What's |
Forgetting the names (and I agree, it's hard to come up with something reasonable) the current behaviour here is to panic. The one which is upper-bounded will currently not panic, which I think is good for truncating behaviour -- if you want your string to be under, say, 20 bytes, and it's empty, that's fine. Lower-bounded ones must panic if the index is above len, since they can't go any lower. If we were to use the concept of bounding the results to a specific character, then our issue is that we treat past the end as an "end of string" character, and this isn't C. My initial terms for the methods were "largest char boundary not exceeding" and "smallest char boundary not below," and I figured those names sucked, but that's the intent that I wanted to preserve in the methods themselves. |
r? @dtolnay as T-libs-api to evaluate the addition of unstable API surface area. Implementations look OK to me modulo comments but I didn't evaluate the correctness of the search criteria (i.e., as casts and equality to -0x40). |
Note that the -0x40 was taken directly from (Edit: I did that) |
8699967
to
d2b0e71
Compare
This comment has been minimized.
This comment has been minimized.
d2b0e71
to
9616038
Compare
This comment has been minimized.
This comment has been minimized.
9616038
to
0b94cfd
Compare
This comment has been minimized.
This comment has been minimized.
0b94cfd
to
77823d8
Compare
This comment has been minimized.
This comment has been minimized.
77823d8
to
c6f504b
Compare
This comment has been minimized.
This comment has been minimized.
That test passed for me locally, and I'm not 100% sure on how to fix it. I also know that u8::is_utf8_char_boundary should probably be inlined. Mostly want to wait for a more affirmative "yes, I like this API ± name bikeshedding" before I put more work into it. |
a814a25
to
8b6d128
Compare
This comment has been minimized.
This comment has been minimized.
Rebased, would be nice if someone could take a look before I have to rebase again. 😉 |
library/core/src/str/mod.rs
Outdated
@@ -213,7 +209,81 @@ impl str { | |||
None => index == self.len(), | |||
|
|||
// This is bit magic equivalent to: b < 128 || b >= 192 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment should probably move to is_utf8_char_boundary
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These method names and signatures look good to me. I was skeptical about the previous iterations but these are nice. Thanks for all the discussion above.
Please open a tracking issue and update the unstable
attributes in the PR.
This is going to require more extensive test coverage prior to landing. See if you can find where the unit tests are for the other methods in this impl block.
@clarfonthey Ping from triage, would you mind addressing the comments above? Thanks! |
Hey, sorry that I took so long getting to this. Will add some more tests and see what I can do about fixing the implementation. |
b943ffa
to
3cbbbc1
Compare
Okay, this is finally ready for a proper review. I added a (hopefully) thorough set of tests and fixed the issue with @rustbot ready |
☔ The latest upstream changes (presumably #90414) made this pull request unmergeable. Please resolve the merge conflicts. |
This is looking pretty good to me, code-wise, now. Can you please add the tracking issue (as mentioned in #86497 (review))? With that I think it'll be good to go. |
3cbbbc1
to
3681dd0
Compare
Tracking issue made, also fixed the test being called Should be ready for final CR |
3681dd0
to
e700670
Compare
This comment has been minimized.
This comment has been minimized.
e700670
to
edd318c
Compare
(I had accidentally committed the wrong version of a submodule when rebasing, so, hopefully it doesn't fail this time.) |
📌 Commit edd318c has been approved by |
…askrgr Rollup of 9 pull requests Successful merges: - rust-lang#86497 (Add {floor,ceil}_char_boundary methods to str) - rust-lang#92695 (Add `#[no_coverage]` tests for nested functions) - rust-lang#93521 (Fix hover effects in sidebar) - rust-lang#93568 (Include all contents of first line of scraped item in Rustdoc) - rust-lang#93569 (rustdoc: correct unclosed HTML tags as generics) - rust-lang#93672 (update comment wrt const param defaults) - rust-lang#93715 (Fix horizontal trim for block doc comments) - rust-lang#93721 (rustdoc: Special-case macro lookups less) - rust-lang#93728 (Add in ValuePair::Term) Failed merges: r? `@ghost` `@rustbot` modify labels: rollup
This is technically already used internally by the standard library in the form of
truncate_to_char_boundary
.Essentially these are two building blocks to allow for approximate string truncation, where you want to cut off the string at "approximately" a given length in bytes but don't know exactly where the character boundaries lie. It's also a good candidate for the standard library as it can easily be done naively, but would be difficult to properly optimise. Although the existing code that's done in error messages is done naively, this code will explicitly only check a window of 4 bytes since we know that a boundary must lie in that range, and because it will make it possible to vectorise.
Although this method doesn't take into account graphemes or other properties, this would still be a required building block for splitting that takes those into account. For example, if you wanted to split at a grapheme boundary, you could take your approximate splitting point and then determine the graphemes immediately following and preceeding the split. If you then notice that these two graphemes could be merged, you can decide to either include the whole grapheme or exclude it depending on whether you decide splitting should shrink or expand the string.
This takes the most conservative approach and just offers the raw indices to the user, and they can decide how to use them. That way, the methods are as useful as possible despite having as few methods as possible.
(Note: I'll add some tests and a tracking issue if it's decided that this is worth including.)