Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify CharIndices.next #61070

Closed
wants to merge 1 commit into from

Conversation

jridgewell
Copy link
Contributor

Char.len_utf8 is stable since #49698, making this a little easier to follow.

Char.len_utf8 is stable since rust-lang#49698, making this a little easier to follow.
@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @rkruppe (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label May 23, 2019
@Centril
Copy link
Contributor

Centril commented May 23, 2019

@bors rollup

@Centril
Copy link
Contributor

Centril commented May 23, 2019

cc @SimonSapin

@hanna-kruppe
Copy link
Contributor

Thanks, LGTM!

@bors r+

@bors
Copy link
Contributor

bors commented May 23, 2019

📌 Commit 7af83dc has been approved by rkruppe

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 23, 2019
@SimonSapin
Copy link
Contributor

char::len_utf8 is implemented with three branches, which seems potentially more costly than the substraction that it replaces. Have you checked that this optimizes well? It’s possible that the optimizer realizes that this computation is redundant with next_code_point, but it’s not obvious.

@hanna-kruppe
Copy link
Contributor

@bors r- (let's wait with merging until this performance question is cleared up)

@bors bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels May 23, 2019
@SimonSapin
Copy link
Contributor

I don’t read assembly fluently enough to conclude anything from this:

@jridgewell
Copy link
Contributor Author

jridgewell commented May 24, 2019

I created a sample benchmark to compare the two implementations:

benchmark
#![feature(test)]
extern crate test;
use std::str::Chars;

pub struct CharIndices<'a> {
    front_offset: usize,
    iter: Chars<'a>,
}

pub fn before(self_: &mut CharIndices) -> Option<(usize, char)> {
    let pre_len = self_.iter.as_str().len();
    match self_.iter.next() {
        None => None,
        Some(ch) => {
            let index = self_.front_offset;                
            let len = self_.iter.as_str().len();
            self_.front_offset += pre_len - len;
            Some((index, ch))
        }
    }
}

pub fn after(self_: &mut CharIndices) -> Option<(usize, char)> {
    let ch = self_.iter.next()?;
    let index = self_.front_offset;
    self_.front_offset += ch.len_utf8();
    Some((index, ch))
}

#[cfg(test)]
mod tests {
    use super::*;
    use test::Bencher;

    #[bench]
    fn before(b: &mut Bencher) {
        let s = "ศไทย中华Việt Nam; Mary had a little lamb, Little lamb";
        let len = s.chars().count();

        b.iter(|| {
            let mut chars = CharIndices { front_offset: 0, iter: s.chars() };
            let mut i = 0;

            while let Some(_) = super::before(&mut chars) {
                i += 1;
            }
            assert_eq!(i, len);
        });
    }

    #[bench]
    fn after(b: &mut Bencher) {
        let s = "ศไทย中华Việt Nam; Mary had a little lamb, Little lamb";
        let len = s.chars().count();

        b.iter(|| {
            let mut chars = CharIndices { front_offset: 0, iter: s.chars() };
            let mut i = 0;

            while let Some(_) = super::after(&mut chars) {
                i += 1;
            }
            assert_eq!(i, len);
        });
    }
}
    Finished release [optimized] target(s) in 0.00s
     Running target/release/deps/tmp-188dab753e97d367

running 2 tests
test tests::after  ... bench:          44 ns/iter (+/- 3)
test tests::before ... bench:          44 ns/iter (+/- 3)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out

It appears they're roughly the same speed? It might be that I'm just doing something wrong, the assembly definitely looks more complex the the after output above.

Before: https://rust.godbolt.org/z/sIJTXu
After: https://rust.godbolt.org/z/7bNVDu

@jridgewell
Copy link
Contributor Author

jridgewell commented May 24, 2019

Nevermind, when I updated the benchmark to actually use the index, it's slower:

benchmark using index
#![feature(test)]
extern crate test;
use std::str::Chars;

pub struct CharIndices<'a> {
    front_offset: usize,
    iter: Chars<'a>,
}

pub fn before(self_: &mut CharIndices) -> Option<(usize, char)> {
    let pre_len = self_.iter.as_str().len();
    match self_.iter.next() {
        None => None,
        Some(ch) => {
            let index = self_.front_offset;                
            let len = self_.iter.as_str().len();
            self_.front_offset += pre_len - len;
            Some((index, ch))
        }
    }
}

pub fn after(self_: &mut CharIndices) -> Option<(usize, char)> {
    let ch = self_.iter.next()?;
    let index = self_.front_offset;
    self_.front_offset += ch.len_utf8();
    Some((index, ch))
}

#[cfg(test)]
mod tests {
    use super::*;
    use test::Bencher;

    #[bench]
    fn before(b: &mut Bencher) {
        let s = "ศไทย中华Việt Nam; Mary had a little lamb, Little lamb";
        let len = s.len();

        b.iter(|| {
            let mut chars = CharIndices { front_offset: 0, iter: s.chars() };
            let mut i = 0;

            while let Some((index, _)) = super::before(&mut chars) {
                i = index;
            }
            assert_eq!(i + 1, len);
        });
    }

    #[bench]
    fn after(b: &mut Bencher) {
        let s = "ศไทย中华Việt Nam; Mary had a little lamb, Little lamb";
        let len = s.len();

        b.iter(|| {
            let mut chars = CharIndices { front_offset: 0, iter: s.chars() };
            let mut i = 0;

            while let Some((index, _)) = super::after(&mut chars) {
                i = index;
            }
            assert_eq!(i + 1, len);
        });
    }
}
   Compiling tmp v0.1.0 (/Users/jridgewell/tmp)
    Finished release [optimized] target(s) in 0.45s
     Running target/release/deps/tmp-188dab753e97d367

running 2 tests
test tests::after  ... bench:          67 ns/iter (+/- 8)
test tests::before ... bench:          55 ns/iter (+/- 2)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out

@jridgewell jridgewell closed this May 27, 2019
@jridgewell jridgewell deleted the charindices-next branch March 3, 2020 06:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants