This is a fast and concurrent deduplication tool that removes duplicate lines in a textfile and leverages multiple cpu cores whilst keeping memory footprint low.
Note: Go 1.9+ is required because of sync.Map.
go get github.com/OneOfOne/xxhash
- Unique lines will be written to disk
- Optional: Duplicate lines will be written to disk
- Non cryptographic hash is used for memory close speed
- low memory usage because of hashmap lookup
- Uses all cores of a system and its optimized for 16 cores and more
- linear performance and ram usage
- Super fast (when you come from Python or Javascript)
- Producer-Consumer pattern used to implement concurrency
Memory usage is always 4 bytes for every unique line in the file. set(lines)*4bytes