rhuffle is a random shuffler for large file with many lines which can exceed available RAM.
rhuffle supports:
- shuffling huge files which does not fit in memory
- skipping head lines which should not include for shuffling (e.g. csv/tsv)
- multiple file input and flexible input formats
- rhuffle works very fast (see benchmark results.)
See lib.rs.
USAGE:
rhuffle [OPTIONS]
FLAGS:
--help Prints help information
-V, --version Prints version information
OPTIONS:
-b, --buf <NUMBER>
Sets buffer size which is smaller than available RAM with bytes (default: 4294967296).
--dst <PATH>
Sets destination file path. If not set, destination sets to stdout. (default: None)
--feed <LF|LF_CRLF> Sets acceptable line feed as EOL (default: LF_CRLF).
-h, --head <NUMBER>
Sets first `n` lines without shuffling (default: 0). For multiple input sources, take README a look.
--log <off|error|warn|info|debug|trace> Sets log level. (default: off)
--src <[PATH]>
Sets source file paths (space separated). If not set, source sets to stdin. (default: None)
- For multiple input sources, first
n
lines in the first input source forwards to output source without shuffling. - For second input source and later, first
n
lines in the first input source are skipped. - Here is an example below:
in1.txt
head1-1
head2-1
line1-1
line2-1
in2.txt
head1-2
head2-2
line1-2
line2-2
$ rhuffle --src in1.txt in2.txt --dst out.txt --head 2
out.txt
head1-1 // L1-L2: fixed
head2-1
line2-1 // L3-L6: shuffled globally
line1-2
line2-2
line1-1
- LF_CRLF(default): accepts LF or CRLF as newline
- LF: accepts only LF as newline
- No option for CR
The results shown below are focused on execution time in a limited memory space. Two datasets are used for testing.
- Kaggle competition dataset (New York City Taxi Fare Prediction)
- (self-owned) custom dataset
Three softwares are used for performance comparison.
- GNU shuf
- command:
shuf {src} -o {dst}
- command:
- terashuf
- command:
terashuf < {src} > {dst}
- command:
- rhuffle
- command:
rhuffle --src {src} --dst {dst}
- command:
Benchmarks are executed on MacBook Pro 2017, Core i7 3.1GHz, RAM 16GB.
Execution time is measured by time
.
5.3GB size, 55423856 lines
Software | real | user | sys |
---|---|---|---|
GNU shuf | 0m59s | 0m34s | 0m14s |
terashuf | 5m06s | 4m43s | 0m14s |
rhuffle | 1m56s | 1m06s | 0m40s |
9.0GB size, 21550072 lines
Software | real | user | sys |
---|---|---|---|
GNU shuf | x | x | x |
terashuf | 8m12s | 7m16s | 0m31s |
rhuffle | 1m47s | 0m39s | 0m51s |
GNU shuf was impossible to measure (very slow).