Consider supporting tar file #1112

upsuper · 2018-11-17T20:48:42Z

Opening this issue as suggested by @BurntSushi in #1111.

I think it would be helpful if ripgrep can support scanning archive files. Scanning tar files can briefly done via -a option, but there are downsides:

you need to extract all files from the archive and scan again to identify the files you want
the line number is unhelpfully large
it may be wasting time on scanning files that we don't actually need

It is unclear how scanning archive files should work. It should definitely be opt-in just like --search-zip. We can probably add --search-archive or --search-tar for this purpose. I guess ideally archive files should probably be handled like directory rather than single file in that mode.

That should lead to some thing like

path/to/archive.tar/file1
something matched

path/to/archive.tar/file2
something else matched

@BurntSushi what do you think?

The text was updated successfully, but these errors were encountered:

BurntSushi · 2018-11-17T20:59:28Z

Thanks for filing the ticket! I think this needs to be fleshed out quite a bit more though.

I guess ideally archive files should probably be handled like directory rather than single file in that mode.

ripgrep has fairly sophisticated filtering support. How does that interact with treating tar files as if they were a directory? e.g., If there is an applicable *.txt exclusion filter in play and a tar archive contains foo.txt, should that file also be ignored? If so, which filters are applied to it and how does that work? Personally, I suspect this will be a significant complication.

upsuper · 2018-11-17T21:16:21Z

For the case you mentioned, it should probably ignore the foo.txt, but still scan other files in the archives, just like what we do for directory.

For how does that work... so it seems currently this is done via ignore crate, but integrating this support into that crate is probably not an option, because we would need decompression support there as well. Maybe we can change how ignore work, that we walk through directories in ripgrep, and feed file path into ignore for whether a directory should be further traversed, and whether a file should be scanned. This effectively would split ignore into two crates, one for walking through file system, another for checking file inclusion. It would be a non-trivial refactor. Alternatively we can add some new API to ignore for querying this kind of information, probably.

BurntSushi · 2018-11-17T21:38:04Z

Right. This is probably blocked on refactoring and cleaning up the ignore crate in general. I don't expect that to happen any time soon, and I haven't had the time to write down enough of my thoughts on that for others to see.

I'll leave this open for now, but I don't think this is going to happen any time soon. Moreover, I am still not quite convinced that this feature belongs in ripgrep. It's definitely something that folks have requested before though.

mateon1 · 2019-07-18T08:54:20Z

I suggest adding a generic --search-archive flag, so this can support different types of archives (like zip files) in the future.

This would be really helpful for me, as I often need to grep through a massive corpus of text, hundreds of gigabytes in size. If I use ripgrep on the unpacked corpus, I get very useful information like filenames of the files that matched, and real line numbers. Unfortunately, because of spinning disk random access times this search usually takes over 6 hours to process the whole corpus.
If I instead search the compressed version of the corpus by piping the output of unzip -p '*'.zip, the search only takes about an hour, but I lose information like which file a specific matching line came from, which sometimes forces me to repeat the search (or do a more specific search) on the uncompressed corpus later.

eadmaster · 2021-05-05T08:20:50Z

I've just found this wrapper that adds some compressed archives support.

Older alternatives i know are zipgrep and zzfind.

Btw, it is not clear to me how the current --search-zip flag is supposed to work, can you provide some examples?

BurntSushi · 2021-05-05T11:35:45Z

@eadmaster It's pretty simple. You give ripgrep compressed files and it searches them:

$ echo 'foo bar quux' > /tmp/haystack
$ rg bar /tmp/haystack
1:foo bar quux
$ gzip /tmp/haystack
$ rg bar /tmp/haystack.gz
$ rg -z bar /tmp/haystack.gz
1:foo bar quux

It sounds like you're trying to give ripgrep compressed archive files. ripgrep doesn't support iterating over archives, even when they're uncompressed. It's orthogonal to --search-zip.

eadmaster · 2021-05-05T13:51:20Z

I see, then the name search-zip is a bit ambiguous for me, why not changing to search-gz or search-gzip?

BurntSushi · 2021-05-05T13:58:16Z

Because the flag name is used in other tools, change causes churn and gzip is not the only supported compression format.

knutwannheden · 2022-03-17T16:08:05Z

Support for archives would be really nice!

Apart from being able to search the occasional downloaded .zip and .tar.gz archive, I also have another use case: I was surprised that ripgrep appeared to be so slow on a VPS hosted by a public cloud provider. It turns out that they use Ceph and (at least the way it is configured) I get good throughput put high latencies when reading files (ephemeral disks are not available). In my case I have over a million files I want to search and it just takes forever. So I created a compressed .tar.gz archive, which I can now search using the -a and -z options, which is a lot faster. The downside here being that ripgrep can't tell me which file within the archive contains the match. This is admittedly a strange use case, but IMHO archive support would still be very nice to have.

jumarko · 2022-05-25T03:12:21Z

I would love to see this implemented.
I also agree with @eadmaster that search-zip option name is confusing (#1112 (comment))

BurntSushi · 2022-05-25T12:25:06Z

The flag name isn't changing.

More comments saying "I want this" aren't helpful. I have questions about how to implement this. What would be helpful is if folks could dive into answering those questions. See my previous comments.

mateon1 · 2022-05-25T22:58:03Z

My view of how the implementation should roughly look:

ripgrep, with an appropriate cli flag should be able to recurse into and open archive files as a pseudo-directory, and search each file within the archive as usual (semantically, almost exactly as if the archive was a subdirectory you entered). Implementation-level, I believe you could define an Archive trait which you could implement for tar files, zip files, and whatever else people want to implement (rar, 7z, zpaq, whatever you can find a crate for - but most behind a feature flag most likely).
The Archive trait would provide methods for traversing the directories and providing whatever file reading methods ripgrep needs. Some hint methods could be useful to indicate that e.g. tar files cannot be reasonably traversed in parallel, only sequentially, while most other archive formats let you do random access to select a file.
ripgrep can report the matching file as path/to/archive.zip//archived/path.txt (exact format open to bikeshedding, maybe :: as separator? I think I've seen antivirus software report files in archives this way)
ripgrep SHOULD be able to recurse into nested archives, possibly limited by some flag.

Note: I have very little understanding of how ripgrep is internally architected, maybe the Archive trait idea won't work, but that's the first thing that comes to mind for how to implement this.

ujay68 · 2023-11-30T11:07:04Z

I would also like to vote for such a feature. (Recursively searching through a directory structure that contains archive formats is really painful ATM.) Some additional ideas:

Could also support Zip-based file formats. Lots of these with different endings: Java (.jar, .war, .ear), Microsoft (.docx, .xlsx & Co.), LibreOffice (.odt & Co.) …
If a specific command-line option is set, I would suggest to treat such archive files as if they were extracted at their location, ie, a file ./d1/d2/a.zip would be treated, for globbing and filtering purposes, like a directory ./d1/d2/a.zip/ containing the contents of a.zip. (Java URIs use a the special character ! for that, ie, paths like d1/d2/a.zip!a.txt.)
Second mateon1's idea of allowing recursion into nested archives (which one has, eg, with .war and .ear).

upsuper mentioned this issue Nov 17, 2018

Attempt to add tar support #1111

Closed

BurntSushi added enhancement An enhancement to the functionality of the software. question An issue that is lacking clarity on one or more points. icebox A feature that is recognized as possibly desirable, but is unlikely to implemented any time soon. labels Nov 17, 2018

BurntSushi mentioned this issue Mar 21, 2019

add decompression support on Unix systems #539

Closed

This comment has been minimized.

Sign in to view

BurntSushi mentioned this issue Feb 7, 2020

searching plaintext files inside zip archives #1479

Closed

BurntSushi mentioned this issue May 24, 2022

search-zip: include .jar files #2136

Closed

orf mentioned this issue Mar 9, 2023

Support searching structured data #2446

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider supporting tar file #1112

Consider supporting tar file #1112

upsuper commented Nov 17, 2018

BurntSushi commented Nov 17, 2018

upsuper commented Nov 17, 2018

BurntSushi commented Nov 17, 2018

mateon1 commented Jul 18, 2019

This comment has been minimized.

eadmaster commented May 5, 2021 •

edited

Loading

BurntSushi commented May 5, 2021

eadmaster commented May 5, 2021

BurntSushi commented May 5, 2021

knutwannheden commented Mar 17, 2022 •

edited

Loading

jumarko commented May 25, 2022

BurntSushi commented May 25, 2022 •

edited

Loading

mateon1 commented May 25, 2022

ujay68 commented Nov 30, 2023

Consider supporting tar file #1112

Consider supporting tar file #1112

Comments

upsuper commented Nov 17, 2018

BurntSushi commented Nov 17, 2018

upsuper commented Nov 17, 2018

BurntSushi commented Nov 17, 2018

mateon1 commented Jul 18, 2019

This comment has been minimized.

eadmaster commented May 5, 2021 • edited Loading

BurntSushi commented May 5, 2021

eadmaster commented May 5, 2021

BurntSushi commented May 5, 2021

knutwannheden commented Mar 17, 2022 • edited Loading

jumarko commented May 25, 2022

BurntSushi commented May 25, 2022 • edited Loading

mateon1 commented May 25, 2022

ujay68 commented Nov 30, 2023

eadmaster commented May 5, 2021 •

edited

Loading

knutwannheden commented Mar 17, 2022 •

edited

Loading

BurntSushi commented May 25, 2022 •

edited

Loading