Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement compression of DFA states. #226

Merged
merged 3 commits into from
May 6, 2016

Conversation

SeanRBurton
Copy link
Contributor

Implements the suggestion from issue #199.

@BurntSushi
Copy link
Member

Thanks! This was assigned to @killercup...

/// is a zero-width assertion.
flags: StateFlags,
}
struct State{data: Box<[u8]>}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please use the same style of formatting as the rest of the code? Probably struct State(Box<[u8]>) would be nice here.

@BurntSushi
Copy link
Member

BurntSushi commented May 5, 2016

OK, I have a few nits but this otherwise looks good! Thank you.

In the future, could you please get in touch with me or make a comment on the issue tracker before working on a non-trivial PR?

@killercup
Copy link
Member

Oh, yeah, I was working on that. But that's okay, your code is basically what I have written as well and then some more.

Just FYI, I originally had the idea to represent the two different sizes (i8 and i32) with a postfixed bit on the first byte, to allow i7 in one byte and i31 in 4 bytes, but I'm sure that while maybe looking more clever it would've been more error prone and probably slower to read in practice.

Now that I think about it, InstPtr is an u32, right? Can we correctly represent all possible deltas with i32s? (Or maybe I'm just missing something, it is pretty early here :))

@BurntSushi
Copy link
Member

Now that I think about it, InstPtr is an u32, right? Can we correctly represent all possible deltas with i32s? (Or maybe I'm just missing something, it is pretty early here :))

Do you mean encode every delta as an i32? I think if you do that, you'll lose the compression benefits? I might be misunderstanding though.

@BurntSushi BurntSushi merged commit 44024df into rust-lang:master May 6, 2016
@BurntSushi
Copy link
Member

Thanks @SeanRBurton!

@BurntSushi
Copy link
Member

probably slower to read in practice

I think if the performance of reading these instruction pointers becomes a factor, then the DFA has already lost. :-) (One can only hope it will realize it, quit and fall back to the slow-but-faster-in-this-case Pike VM.)

@SeanRBurton
Copy link
Contributor Author

RE: #226 (comment)

I guess he means that if a delta > i32::MAX then we can't represent it in 4 bytes. Maybe we should just say that insts.len() has to be <= i32::MAX to use the dfa.

@killercup
Copy link
Member

I was only thinking about the upper bound, where this uses 4 u8's to build a i32. It's probably irrelevant in praxis, but since InstrPtrs are u32, there could in theory be a jump from u32::MIN to u32::MAX which the i32 delta can't represent. (Maybe just add a debug_assert? Are there overflow checks?)

Oh, and BTW, are there any benchmarks? Andrew mentioned in the original issue this could improve the class negation benchmark (IIRC).

Am 06.05.2016 um 12:56 schrieb Andrew Gallant notifications@github.com:

Now that I think about it, InstPtr is an u32, right? Can we correctly represent all possible deltas with i32s? (Or maybe I'm just missing something, it is pretty early here :))

Do you mean encode every delta as an i32? I think if you do that, you'll lose the compression benefits? I might be misunderstanding though.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

@BurntSushi
Copy link
Member

BurntSushi commented May 6, 2016

I guess he means that if a delta > i32::MAX then we can't represent it in 4 bytes. Maybe we should just say that insts.len() has to be <= i32::MAX to use the dfa.

Ah, yeah, we only use u32::MAX now: https://github.com/rust-lang-nursery/regex/blob/master/src/dfa.rs#L71 --- Changing that to i32::MAX seems reasonable.

Oh, and BTW, are there any benchmarks? Andrew mentioned in the original issue this could improve the class negation benchmark (IIRC).

I haven't run them yet, but I will before the next release to make sure there are no regressions.

It will only impact the class negation benchmark if it brings memory usage down from 4-5MB or so to under 2MB. (When I last thought about, I didn't think this optimization was capable to doing that much.)

BurntSushi added a commit that referenced this pull request May 7, 2016
This slightly modifies the implementation in #226 to use variable width
integers. This saves a touch more space with regexes with huge
alternations spanning larger-than-127 instructions (typically from Unicode
character classes).

This also adjusts `approximate_size`, which is the actual gatekeeper
behind deciding whether the DFA has filled its cache or not. The approximation
is now reduced slightly to account for the space savings.

The variable integer encoding used is from Protocol Buffers, documented
here: https://developers.google.com/protocol-buffers/docs/encoding#varints
SeanRBurton pushed a commit to SeanRBurton/regex that referenced this pull request May 20, 2016
This slightly modifies the implementation in rust-lang#226 to use variable width
integers. This saves a touch more space with regexes with huge
alternations spanning larger-than-127 instructions (typically from Unicode
character classes).

This also adjusts `approximate_size`, which is the actual gatekeeper
behind deciding whether the DFA has filled its cache or not. The approximation
is now reduced slightly to account for the space savings.

The variable integer encoding used is from Protocol Buffers, documented
here: https://developers.google.com/protocol-buffers/docs/encoding#varints
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants