Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEX #57

Merged
merged 3 commits into from
Feb 13, 2017
Merged

DEX #57

merged 3 commits into from
Feb 13, 2017

Conversation

daviddias
Copy link
Member

@whyrusleeping let's fill the implementations chapter with pointers to where the current chunker and layout is implemented and go and align on what the interfaces should be for this importers

@whyrusleeping
Copy link
Member

For importers, we have the default chunker that splits the input stream into blocks of 256k, and one that does rabin fingerprints for the chunking. The interface for our chunker looks like:

type Chunker interface {
    NextBytes() ([]byte, error)
}

NextBytes()gets called on the chunker until no bytes are remaining, at which point it returns an EOF sentinel error (implementation details).

The second part of our importing code is the layout engine, we have two of those as well. The default is the balanced tree. The tree has a width at each layer of 256k/sizeof(link)

The general algorithm for building the balanced tree is this:

if only one block exists, it is its own depth=1 tree. From depth 1 trees, we can generate depth=2 trees by generating up to MAXWIDTH depth=1 trees and adding them as children of a new node. Using the same recursive logic, we can generate any depth of tree dynamically as more and more data comes in from the chunker (eliminating the need to know data size beforehand to select a depth). Note, data is ONLY stored in the leaf nodes, storing data in intermediate nodes is complicated and fragile and not conducive to effective deduplication (although some significant latency gains might be made by doing so).

The second layout algorithm is called the trickledag and is something i wrote to provide a data format better suited to sequential streaming of content. The basic idea is that a depth N tree has X leaf nodes as children, and Y trees of each depth up to N-1. Its essentially an expanded binomial heap construction. The advantage is that at each point in traversing the tree (at least sequentially) you can make a single request and get real data. That code is here: https://github.com/ipfs/go-ipfs/blob/master/importer/trickle/trickledag.go

@daviddias
Copy link
Member Author

daviddias commented Dec 6, 2016

For context and reference material

Trickledag

The TrickleDAG layout is a Binomial Heap with Dynamic Width and multiple repetitions at each layer defined by a constant (so far it is 4 and never changed)

img_0941

When the trickledag got added to go-ipfs - ipfs/kubo#713

@daviddias daviddias changed the title WIP: Data Importing Spec DEX Feb 13, 2017
@daviddias daviddias merged commit 827769e into master Feb 13, 2017
@daviddias daviddias deleted the ipfs/data-importing branch February 13, 2017 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants