Skip to content

Latest commit

 

History

History
78 lines (52 loc) · 6.02 KB

README.md

File metadata and controls

78 lines (52 loc) · 6.02 KB

GoPar3: File Armor for the Paranoid, Beta

https://pkg.go.dev/github.com/dkotik/gopar3 https://github.com/dkotik/gopar3/actions?query=workflow:test https://coveralls.io/github/dkotik/gopar3 https://goreportcard.com/report/github.com/dkotik/gopar3 https://img.shields.io/badge/Zero-AI-0099ff?style=flat-square&logo=rootme&logoColor=white&labelColor=004018

Protect files from partial loss or corruption.

Usage

# Installation:
go install github.com/dkotik/gopar3/cmd/gopar3@latest

# Split a file into multiple shards:
gopar3 split --redundancy=80% myfile

# Inflate a file by keeping all the shards together:
gopar3 inflate --redundancy=80% myfile

# Restore the original file from shards in the current folder:
gopar3 restore *.gopar3

Why do you need Gopar3?

Around year 2000, the Pararchive program became popular for protecting files from data degradation in storage or during network transfer when downloading files from Usenet. It increased file robustness by adding parity data to the file and splitting it into shards. As long as you could gather a certain number of shards, you could restore the original file.

As network and storage reliability increased, Pararchive fell out of favor. Almost every common device, network protocol, or backup utility includes error detection and correction in some form today.

You need Gopar3 if you are paranoid regarding the necessity of preserving certain files through the worst possible external damage and hardware failures. Gopar3 adds two more layers that increase the chances of file recovery:

  1. Telomeres: same byte repetition sequences that surround data shards and protect the boundary between them.
  2. Polynomial Reed-Solomon recovery codes: additional data that allows the restoration of the original as long as quorum of data shards of available.

Development Roadmap

  • Allow os.Stdin streaming. Capture data content type and checksum while streaming to a temporary file. Then, run the split, inflate, or scatter command as normal.
  • Restore should recover all possible files rather than the first one, unless an index is provided.
  • Make sure small shard sizes can accommodate the maximum possible number of blocks for large source files before starting encoding.
  • Add "repair shard" command that will flip bytes until checksum passes?
  • Reap ideas from https://jacobfilipp.com/arvid-vhs/
    • As best as I could figure out, data was stored to tape using Non-Return-to-Zero encoding (according to some Fido7 comments). On the tape itself, files were recorded in sections separated by 5-second blank intervals.

    • Home-use VHS tapes had worse quality than commercial-grade magnetic tape, so the makers took extra measures to detect and fix errors on tape. (already doing some of this)

    • ArVid read and wrote data using an error correction algorithm called “Reed-Solomon with Interleaving” (I also came across mentions of a Galois algorithm). They claimed that this let the ArVid software correct up to 3 defective bytes in a code group, and a loss of up to 450 consecutive bytes could be corrected. After reading data from tape, the software performed a CRC32 check for errors, operating on every 512-byte block.

Shard Format

GoPar3 uses a telomere encoder to guard block boundaries. Telomeres are repetitions of ":" padding characters, eleven by default. Occurrences of ":" and "\" within the block data are escaped using "\". The telomere encoder helps preserve block boundaries in severely damaged files. Even if some blocks are thrown out of alignment by shortening, they can be isolated from healthy blocks and partially recovered:

Telomeres Checksum1 Checksum2 Size Quorum Order Batch Data... Telomeres
11 bytes 4 bytes 4 bytes 8 bytes 1 byte 1 byte 2 bytes ... 11 bytes

The checksums are Castognoli's 32-byte circular redundancy codes. The first one validates all the shard data that follows it until the next telomere boundary. The second validates the original source file. It will be the same for all related shards created from the same source file. Likewise, size is the size in bytes of the original file.

Quorum is the number of whole shards required for restoring any of the missing ones. Order is the shard position in its batch of shards. Batch is the serial order of batches. The total number of batches is:

math.Ceil(fileSize / (shardSize * quorum))

All the shard metadata fits in about 20 bytes per shard, yet it allows file restoration from a shuffled set of shards pulled from multiple files, as long as the quorum of each batch is intact. The size field is necessary for calculating the padding bytes that will be dropped from the final batch.

Shard Inspection

GoPar3 can produce the list of all data shards in given sources. The index information may be used to attempt manual restoration of damaged shards.

gopar3 inspect encoded-0-0.gopar3 encoded-0-1.gopar3 > index.json

Similar Projects

  • akalin/gopar: Go implementation of the original Par1 and Par2 formats.
  • Par3 Specification: doomed by over-engineering, imho, specification for the next generation of Pararchive.