Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a bug in literal extraction from exact-count repetition patterns #1367

Closed
wants to merge 1 commit into from

Conversation

jakubadamw
Copy link
Contributor

Closes #1319.

@jakubadamw
Copy link
Contributor Author

jakubadamw commented Sep 6, 2019

I'll honestly say that I am not sure if this is the right fix. I just don't know the code. But this particular line looked suspicious.

I did try to verify this empirically by writing a fuzzer to look out for more bugs by fuzz-feeding the grep_matcher::RegexMatcher with inputs derived from a particular regular expression with the regex_generate crate, i.e. ones that by definition ought to match. As of this comment the fuzzer has not found any more failing cases (and it had found the one described in the bug). However, that still is no proof that this fix is correct.

@jakubadamw
Copy link
Contributor Author

jakubadamw commented Sep 6, 2019

One of the failing fuzz-generated cases of the sort related to this issue:

(lldb) p needle
(alloc::string::String) $0 = "TAACCCCGTTGTACAGTTGCAGCCGCA"
(lldb) p haystack
(alloc::string::String) $1 = "TTTTTTTTTTAACCCCGTTGTACAGTTGCAGCCGCATATTTTTTTTT"
(lldb) p regex_string
(alloc::string::String) $2 = "TAA[ATCG]{5,5}TTGTACAGTTGCAGCCGCA"

@BurntSushi BurntSushi added the rollup A PR that has been merged with many others in a rollup. label Feb 17, 2020
BurntSushi pushed a commit that referenced this pull request Feb 17, 2020
This appears to be another transcription bug from copying this code from
the prefix literal detection from inside the regex crate. Namely, when
it comes to inner literals, we only want to treat counted repetition as
two separate cases: the case when the minimum match is 0 and the case
when the minimum match is more than 0. In the former case, we treat
`e{0,n}` as `e*` and in the latter we treat `e{m,n}` where `m >= 1` as
just `e`.

We could definitely do better here. e.g., This means regexes like
`(foo){10}` will only have `foo` extracted as a literal, where searching
for the full literal would likely be faster.

The actual bug here was that we were not implementing this logic
correctly. Namely, we weren't always "cutting" the literals in the
second case to prevent them from being expanded.

Fixes #1319, Closes #1367
Copy link
Owner

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay, but thanks so much for this PR! I agree that this does fix the problem. I simplified this just a bit by just removing the if n < min { check above and uncondionally running lits.cut().

This code is completely inscrutable, but the essence of the problem is that the literal detector was continuing the extension of literals when it shouldn't have been.

I'll be merging this in #1486, where your commit message has grown a few details if you're curious.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rollup A PR that has been merged with many others in a rollup.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

match bug
2 participants