-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
should binary file detection be suppressed in aggregate views? #855
Comments
Could you please try to reproduce this bug on data that you can share? Otherwise, there's nothing I can do. |
replaced all the text with 'x'. you can still search for dates.
hope this helps |
This is a good one! It turns out your file contains The interesting bit here is that grep also does binary file detection (in basically the same way as ripgrep), so why the difference? Well, it looks like grep will disable its binary file skipping if it's showing counts instead of printing lines, where as ripgrep always stops searching if it finds a binary file. This becomes evident if you actually try to look at the results:
You can make this work by passing the I suspect that ripgrep should probably behave the same as grep in cases like this, however making the search inconsistent depending on whether @abhichandra21 I wonder if this was your issue too with respect to incorrect results. (You never reported the results of using the |
@minusf Your NUL bytes are on line 477 of the file you gave me:
There are |
yesss, sorry i was lazy to indeed counting in grep works even for binary files, and i had to pass it slowly but steadily the distinction between text files and binary files is washing away. if we have a 4MB file that is mostly text and emojis sprinkled with some 0 bytes here and there (cause whatapp's exporter does ...) is it a text or a binary file? is it a valid dychotomy between textness and binaryness? cause i would argue that this is a text file in the end. a file is a file is a file. we are all one 😸 |
Let's please not re-litigate binary file detection. It's not productive without simultaneously focusing on implementation details, because those details matter, a lot. Detecting Also, emojis have nothing to do with this. There is no emoji whose UTF-8 encoding contains a |
sorry for the harmless semi philosophical nosense. i'll be more than happy when |
To be clear, that bug is tracked by #306. What isn't clear to me is whether we should be matching the behavior of grep in the original case you reported with the |
as the heuristics are very simple, indeed a i don't think the while there is dedicated binary comparison utility, also my man pages says:
|
This commit attempts to surface binary filtering in a slightly more user friendly way. Namely, before, ripgrep would silently stop searching a file if it detected a NUL byte, even if it had previously printed a match. This can lead to the user quite reasonably assuming that there are no more matches, since a partial search is fairly unintuitive. (ripgrep has this behavior by default because it really wants to NOT search binary files at all, just like it doesn't search gitignored or hidden files.) With this commit, if a match has already been printed and ripgrep detects a NUL byte, then it will print a warning message indicating that the search stopped prematurely. Moreover, this commit adds a new flag, --binary, which causes ripgrep to stop filtering binary files, but in a way that still avoids dumping binary data into terminals. That is, the --binary flag makes ripgrep behave more like grep's default behavior. For files explicitly specified in a search, e.g., `rg foo some-file`, then no binary filtering is applied (just like no gitignore and no hidden file filtering is applied). Instead, ripgrep behaves as if you gave the --binary flag for all explicitly given files. This was a fairly invasive change, and potentially increases the UX complexity of ripgrep around binary files. (Before, there were two binary modes, where as now there are three.) However, ripgrep is now a bit louder with warning messages when binary file detection might otherwise be hiding potential matches, so hopefully this is a net improvement. Finally, the `-uuu` convenience now maps to `--no-ignore --hidden --binary`, since this is closer to the actualy intent of the `--unrestricted` flag, i.e., to reduce ripgrep's smart filtering. As a consequence, `rg -uuu foo` should now search roughly the same number of bytes as `grep -r foo`, and `rg -uuua foo` should search roughly the same number of bytes as `grep -ra foo`. (The "roughly" weasel word is used because grep's and ripgrep's binary file detection might differ somewhat---perhaps based on buffer sizes---which can impact exactly what is and isn't searched.) See the numerous tests in tests/binary.rs for intended behavior. Fixes #306, Fixes #855
What version of ripgrep are you using?
What operating system are you using ripgrep on?
macOS 13.13.3
Describe your question, feature request, or bug.
I have 4MB big whatsapp chat history, mostly text but lots of emojis.
Doing some manual stats, i wanted to count number of chats, etc.
The data spans 3 years, 2016-2018. Lines look like this:
Nothing special, really. Trying to count all chats:
Searching for any other pattern stops at the same line in the file
(checked with actual output instead of
-c
):For the personal nature of the corpus i am not able to include it...
But none of the surrounding lines where
rg
stops even have emojis...If this is a bug, what are the steps to reproduce the behavior?
see above
If this is a bug, what is the actual behavior?
see above
If this is a bug, what is the expected behavior?
see above the grep counts
The text was updated successfully, but these errors were encountered: