-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to use on queries that start with singlequote,' ,ascii char 39 #493
Comments
Could you share |
Sure:
Also I am not sure what the expected behaviour if a word can start with a singlequote but if you start with a space you still get this error. |
Wow, that was fast, thanks. |
@pindash No problem. That is an epic blog post! Nice work! There are definitely some details in there that are questionable though... Firstly, I'm kind of curious to see so much focus put on memory maps. Did you happen to read through the sections on memory maps in my blog posts? They tend to provide a small speed boost on very large files, but that's about it. Secondly, I'm impressed you were able to get so much mileage out of Thirdly, GNU grep doesn't actually use Aho-Corasick. It uses Commentz-Walter.
I don't think a blanket statement like this is accurate. Have you seen the benchmarks in my blog post? They might have fixed the performance problems when searching simple ASCII literals, but they certainly haven't fixed them when using actual regexes like
Unfortunately, this analysis is wrong. The Teddy algorithm is only used when there is a small number of common literals identified. In a large list like the one you're using, it won't be used. (And even then, you didn't say how you install ripgrep, which means you might not even have SIMD turned on.)
That can be tweaked with
I think this conclusion does turn out to be right though. :-) Or at least, part of the story.
This sounds like a bug. Nice find. :-) I created #497.
I am starting to think that I shouldn't say that ripgrep (or the regex engine) actually uses Boyer-Moore. The reason why is because it doesn't actually use a shift table, which is kind of a hallmark of Boyer-Moore, and therefore, there's no "skipping." Instead, it's fast because it picks "rare" bytes to feed to With that said, this certainly has no effect on the commands you're running. Boyer-Moore is for single substring search, not multiple substring search.
Not sure why you're saying this. Are you actually seeing a huge difference between grep with memory maps and grep without? (Using the same version of grep!) You should be able to run
I think because of those assumptions, you're fundamentally solving a different problem. :-) If you made those assumptions in a custom search tool, then I wonder what the results would be! |
[Not sure if this should be continued here]
@BurntSushi
Thank you, I will try and go through all of your points and fix any of my
errors and assumptions. It probably deserves its own post with a python
runable[you inspired me] so that tests are repeatable.
I will verify with empirical evidence, but I expected memory maps [and they
were, but again I need to verify] to be faster especially if many processes
are reading the same large file. Especially since though this file was
large it comfortably fits in the systems main memory and therefore the page
files.
Also, Boyer-Moore has a weakness if you have a wild char at the end of a
string where as your rare byte optimization will work regardless so long as
a rare byte gets chosen.
Also if ripgrep is not using teddy on patterns of 200-400 why does it
perform faster on a sorted pattern file, where the tokens all share a
common first letter for example?
I think because of those assumptions, you're fundamentally solving a
different problem. :-)
I quite agree, the problem ripgrep is solving is really more general, and
technically an inverted file index would be the appropriate solution for
something like this. At the same time when doing a one off analysis, the
overhead of building the index is too high a penalty.
Your thoroughness and expertise are much appreciated.
On May 30, 2017 7:17 AM, "Andrew Gallant" <notifications@github.com> wrote:
@pindash <https://github.com/pindash> No problem. That is an epic blog
post! Nice work! There are definitely some details in there that are
questionable though...
Firstly, I'm kind of curious to see so much focus put on memory maps. Did
you happen to read through the sections on memory maps in my blog posts?
They tend to provide a small speed boost on very large files, but that's
about it.
Secondly, I'm impressed you were able to get so much mileage out of rg -f.
I've never actually benchmarked ripgrep's -f flag against GNU grep's -f
flag, so I don't think it's fair to say that I think it's faster. Namely,
its current implementation is a quick hack. Note though that the regex
engine may elect to use Aho-Corasick, so you can't really say whether it's
being used or not. (Although, given the number of patterns, the regex
engine is almost certainly avoiding all literal optimizations.) A better
implementation would use Aho-Corasick explicitly inside ripgrep when the -F
flag is used.
Thirdly, GNU grep doesn't actually use Aho-Corasick. It uses
Commentz-Walter.
Also, the newest grep had somehow increased it’s speed on non-ASCII
characters. So some of the traditional speedups like setting up LC_ALL=C,
which forces grep to use ASCII characters doesn’t actually increase the
speed.
I don't think a blanket statement like this is accurate. Have you seen the
benchmarks in my blog post? They might have fixed the performance problems
when searching simple ASCII literals, but they certainly haven't fixed them
when using actual regexes like \w. :-)
Ripgrep then goes one step further and tries to optimize this by searching
for a common literal among all those patterns so that it can skip using the
regex engine which is much slower. If it finds a common literal string
among the patterns it will submit it to it’s implementation of the teddy
algorithm, which takes advantage of Intel’s more recent SIMD instructions
(single instruction multiple data, AKA vectorized processing). The teddy
algorithm submits the literal to be checked against 16 bytes at a time and
if it finds a match it will go into the slower regex engine to verify. This
implementation forces our pattern files to be smaller since a pattern-file
that is too big will not have a common literal or will match too often.
Unfortunately, this analysis is wrong. The Teddy algorithm is only used
when there is a small number of common literals identified. In a large list
like the one you're using, it won't be used. (And even then, you didn't say
how you install ripgrep, which means you might not even have SIMD turned
on.)
In fact, the pattern-files have a limit that will throw an error if they
are too big because they overwhelm the regex engine.
That can be tweaked with --dfa-size-limit and --regex-size-limit.
Because of this limitation, ripgrep has to search the entire file many more
times than grep has to.
I think this conclusion does turn out to be right though. :-) Or at least,
part of the story.
Additionally, ripgrep will perform terribly if the original patternfile is
not sorted. I suspected that ripgrep was benefitting from our sorted tokens
file, and was able to find common literals, which was allowing it to be
much faster than a random tokens file.
This sounds like a bug. Nice find. :-) I created #497
<#497>.
Also because ripgrep uses a nonnaive version of Boyer-Moore it is able to
skip many more bytes.
I am starting to think that I shouldn't say that ripgrep (or the regex
engine) actually uses Boyer-Moore. The reason why is because it doesn't
actually use a shift table, which is kind of a hallmark of Boyer-Moore, and
therefore, there's no "skipping." Instead, it's fast because it picks
"rare" bytes to feed to memchr, which maximizes the amount of time search
stays in a highly optimized SIMD routine.
With that said, this certainly has no effect on the commands you're
running. Boyer-Moore is for single substring search, not multiple substring
search.
A disclaimer where I suggest using ripgrep until grep reintroduces mmap for
large files. Old versions of grep have bugs that are not documented all
that well.
Not sure why you're saying this. Are you actually seeing a huge difference
between grep with memory maps and grep without? (Using the same version of
grep!) You should be able to run rg with and without memory maps too, e.g., rg
--mmap and rg --no-mmap.
So if you know that everything is in ASCII and you know that the words in
the description are space delimited nothing beats KDBQ.
I think because of those assumptions, you're fundamentally solving a
different problem. :-) If you made those assumptions in a custom search
tool, then I wonder what the results would be!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#493 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACDWYfeI35oXkk05hEtmJpyh2ofGhXzzks5r-_rbgaJpZM4NonOJ>
.
|
Right. You might get a small win. I've never seen memory maps blow away a well written incremental solution (as can be found in GNU grep and ripgrep). Here's an example on a ~10GB file (
So... maybe that is a bit bigger than a "small" win. :-)
Right. This is kind of an artifact of the modern era, which is interesting. The issue is that most Boyer Moore implementations these days implement their skip loop with
I honestly don't know. But it surely has nothing to do with Teddy. Even if Teddy were running, the order of the literals shouldn't really matter. My current guess is that you're tripping over a performance bug in literal extraction. (Which is silly, because the number of literals is so high, we'll end up throwing away any extracted literals anyway.) |
@pindash One other important thing to note: when I came up with the name "ripgrep," I was trying to think of short pertinent names that began with the letter R (R for Rust). I liked "rip" because of the connotation of "rip through your text." It wasn't until after I publicly mentioned the name that the alternative "rest in peace" meaning was made clear to me by others. I just hadn't realized it until someone actually said it. :-) If I had realized that meaning before hand, I probably would have changed the name, since it's not particularly nice. On the other hand, the name "grep" isn't owned by any one group of people. |
Steps to reproduce:
rg -F " 're " -o -w input.txt
thread 'main' panicked at 'called
Option::unwrap()
on aNone
value', /checkout/src/libcore/option.rs:323note: Run with
RUST_BACKTRACE=1
for a backtrace.It appears that the combination of asking rg to only return matches and word breaks causes the return value to to be empty.
Running :
rg -F "'re" -o input.txt
Works well, and running with just -w flag also works.
Also if any other char begins the query, then everything works as well.
The text was updated successfully, but these errors were encountered: