Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase lookup speed and decrease binary size #13

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

PeterReid
Copy link

This patch changes the is_xid_start and is_xid_continue checks to use a list of bits instead of a binary search for some lower part of the unicode range. The code generating script chooses the cutoff for switching strategies to minimize binary size. For my benchmark on my machine, it reduces the runtime from 2.2 seconds to 1.3 seconds and decreases the binary size (release mode) by 4096 bytes. (I would like to get a more precise look at the effect on size, but cargo-bloat is not helping me for some reason.)

I used the following benchmark for this:

fn main() {
    let t0 = Instant::now();
    let mut starts = 0;
    for _ in 0..10000 {
        for i in 1..0x3000 {
            if let Some(c) = char::from_u32(i) {
                if UnicodeXID::is_xid_start(c) {
                    starts += 1;
                }
            }
        }
    }
    let tf = Instant::now();
    println!("{} {:?}", starts, tf-t0);
}

@matklad
Copy link
Contributor

matklad commented Sep 6, 2019

@PeterReid this sounds cool!

Have you seen https://github.com/BurntSushi/ucd-generate? I think it makes sense to replace our python generation script with that. So perhaps it’s best to start with a PR against that crate? Note that ucd-generate has three different strategies for generating the tables, so it makes sense to benchmark against those as well.

@matklad
Copy link
Contributor

matklad commented Sep 6, 2019

For performance, it also might make sense to fast-path ASCII at the call site: https://github.com/rust-lang/rust/blob/6b5f9b2e973e438fc1726a2d164d046acd80b170/src/librustc_lexer/src/lib.rs#L153

@PeterReid
Copy link
Author

I think you are right that the fast-path for ASCII would be another optimization. The rust compiler already has a fast path for that, so it would slow things down in that case unless the compiler was clever enough to see the redundancy across crates: https://github.com/rust-lang/rust/blob/618768492f0c731fcb770dc2d178abe840846419/src/librustc_lexer/src/lib.rs#L146

Redoing this pull request by rewriting ucd-generate is an interesting idea, but that is up to someone else.

epage pushed a commit to epage/unicode-xid that referenced this pull request Apr 28, 2021
See rust-lang/rust's `src/librustc_lexer/src/lib.rs`

Idea came from unicode-rs#13
epage pushed a commit to epage/unicode-xid that referenced this pull request Apr 28, 2021
This is based on a one-off benchmark from unicode-rs#13
@epage epage mentioned this pull request Apr 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants