Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how things are decompressed should be explained #15

Open
matu3ba opened this issue Jul 7, 2021 · 5 comments
Open

how things are decompressed should be explained #15

matu3ba opened this issue Jul 7, 2021 · 5 comments
Labels
documentation Improvements or additions to documentation

Comments

@matu3ba
Copy link

matu3ba commented Jul 7, 2021

"Input files will be automatically decompressed if their file extension is recognizable and a local binary exists to perform the decompression (similar to ripgrep)"

This is very vague for new users and annoying to cross-check how ripgrep does it. Please provide a more accurate description and/or link the functionality description.

likely related BurntSushi/ripgrep#539

@sstadick
Copy link
Owner

sstadick commented Jul 7, 2021

You are correct, I totally punted on writing the docs for that and enumerating what tools are expected with each file extension.

I want to add the same preprocessor logic that ripgrep has as well.

I'll add some better docs around that in the near future.

Link for future me to what formats and binaries are paired together:
https://github.com/BurntSushi/ripgrep/blob/9eddb71b8e86a04d7048b920b9b50a2e97068d03/crates/cli/src/decompress.rs#L468

@sstadick sstadick added the documentation Improvements or additions to documentation label Jul 7, 2021
@sstadick
Copy link
Owner

sstadick commented Jul 8, 2021

@matu3ba
Copy link
Author

matu3ba commented Jul 8, 2021

@sstadick cool stuff and thanks for the quick reply. This application has a well deserved place + use cases.

Some feedback on the TODOs, since they kinda are related about undocumented/yet unplanned:

  1. Personally I would emphasize against "Bake in grep / filtering somehow?", since that sounds like ending up with alot use cases of xsv, which is more general (and ideally would also have mmapped/in-place decompression as feature).
  2. "Add preprocessor / pigz support" => change to parallel uncompressing. What do you mean with "preprocessor" ? Reuse as a library?
  3. "Implement parallel parser" => As far as I understand it you read the file chunk until EOL and can create thread work lists with the file offsets: 1 file offset to start of line, 1 to EOL. The stuff in between can be split and processed with SIMD on the respective thread. Why do you believe that keeping arrays and indexes and working with 1 thread and SIMD is not more efficient? You only need to work on indexes and do the lookup (you dont modify data) or what am I missing?
  4. "we don't care about escaping quotes and such" => Does this mean that escaping can have an effect? You might want to be explicit that users don feed stupid regexes that break the EOL (\r or \r\n or \n) in any way.

@sstadick
Copy link
Owner

sstadick commented Jul 8, 2021

Thanks for the reply and interest!

  1. I tend to agree and am still thinking about that one. I starts to get into the territory of awk as well (btw you should check out frawk). It's not infrequent that I am parsing a huge file and want to filter based on a single column but also do some reordering etc. There would be a lot of work to make this feature both fast and ergonomic. It's not in danger of being added any time soon.

  2. pre-processor like this. admittedly not terribly useful for hck, but I do want to find a way to make use of pigz if it is found instead of gzip.

  3. Good question, it will be finicky and possibly not worth it. Running frawk with parallelize and llvm enabled on macos at least is still faster than hck. It implements the parallel algorithm in the linked paper (fig4). In an earlier version of hck I did exactly what you described, read a chunk to a guaranteed line ending on one thread, send it to another thread to be parsed, then sent to a third thread to be written. It wasn't ideal and a lot of time was wasted allocating buffers because I couldn't work out a nice way to share them. Long story short, based on what I've tried so far it is unlikely that parallel will be faster, I just find that paper interesting.

  4. That statement is in regards to the paper / frawk which care alot about figuring out quoting for valid csv / tsv records, whereas hck takes the same line as cut and just splits on the delimiter no matter what. As of now hck actually handles the user giving \n line endings as delimiters gracefully.

@matu3ba
Copy link
Author

matu3ba commented Jul 8, 2021

It wasn't ideal and a lot of time was wasted allocating buffers because I couldn't work out a nice way to share them.

Usually guarding sufficiently sized chunks with atomics for access and an atomic access state enum should be enough.
With a ringbuffer and only the main thread "reserving and freeing memory" you can also easily restrict maximum memory and wait if the threads are not done with their work yet to poll again the last one.
You can "allocate" with the main thread in the same size (only user-configured and at program startup allocated buffers anyway). Since each worker thread owns the output buffer (it just takes the stuff to have work), the main thread may only need to check the latest elements "it allocated" creating no memory reorderings for the other threads.

The only thing that this does not solve are necessary priority increases (via scheduler) of CPU cores/threads not being able to run or threads being finished earlier.

Did you try 2 static buffers (ringbuffers) [1 for input, 1 for output] for the memory allocation (by number of threads) yet?
I hope my explanation is not too chaotic. The idea is loosely related to this, but with a "statically allocated" ringbuffer and detaching all threads after spawn.

That statement is in regards to the paper / frawk which care alot about figuring out quoting for valid csv / tsv records, whereas hck takes the same line as cut and just splits on the delimiter no matter what. As of now hck actually handles the user giving \n line endings as delimiters gracefully.

You might want to specify this property in the Features then. Ideally as Nongoals (performance costs, complexity, usability/generality for non-csv data) or as "Currently stuff works like this, but this might be changed to support xyz."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants