Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add minimal Kademlia DHT spec #108

Merged
merged 40 commits into from
Jun 29, 2021
Merged
Changes from 3 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
5378e61
WIP Initial Kademlia DHT spec.
raulk Nov 7, 2018
ee9734b
format kad-dht spec.
raulk Nov 8, 2018
8b89dc2
Update Kademlia DHT spec
jhiesey Dec 21, 2018
cb971aa
dht: fix distance function
Stebalien Aug 8, 2019
aee103e
Merge branch 'libp2p/master' into kad-dht-spec
mxinden May 7, 2021
4d43958
kad-dht: Add document header
mxinden May 7, 2021
55d32ff
kad-dht: Wrap lines at 80 and fix typos
mxinden May 7, 2021
3de65ae
kad-dht: Document to use one stream per request
mxinden May 7, 2021
6909f8a
kad-dht: Reword validation description
mxinden May 10, 2021
b3693d0
kad-dht: Document FIND_NODE
mxinden May 12, 2021
f05ee26
kad-dht: Restructure by DHT operations
mxinden May 12, 2021
6ae2551
kad-dht: Allow stream reuse and document length prefix
mxinden May 14, 2021
0126af1
kad-dht: Require closer peers even with value
mxinden May 14, 2021
008f1a8
kad-dht: Reword algorithm sections
mxinden May 14, 2021
c21ca3a
kad-dht: Rework references
mxinden May 14, 2021
f240a22
kad-dht: Remove appendix detailing difference of implementations
mxinden May 14, 2021
6c1c224
kad-dht: Document early GET_VALUE termination
mxinden May 14, 2021
defe08d
kad-dht: Do not specify k-bucket split strategy (for now)
mxinden May 14, 2021
e1442fa
kad-dht: Deprecate PING message type
mxinden May 14, 2021
26dd2f3
kad-dht: Document timeReceived formatted with RFC3339
mxinden May 14, 2021
a9ec523
kad-dht: Use peer ID instead of node ID in bootstrap
mxinden May 14, 2021
77168f9
kad-dht: Fix peer routing link
mxinden May 14, 2021
dbe1ff7
README: Add kademlia to protocols index
mxinden May 14, 2021
aa7e8fc
kad-dht: Fix protobuf indentation
mxinden May 14, 2021
d742e2e
kad-dht/README.md: Fix typo
mxinden May 27, 2021
9355a8f
kad-dht/README: Remove requirement on kbucket data structure
mxinden Jun 3, 2021
072360f
kad-dht/README: Restructure and reword DHT operations section
mxinden Jun 3, 2021
c4d4b53
kad-dht/README: Seed with k instead of alpha peers
mxinden Jun 3, 2021
b074091
kad-dht/README: Require algorithms to make progress towards target key
mxinden Jun 9, 2021
a065aac
kad-dht/README: Remove `/pk` special namespace
mxinden Jun 25, 2021
20b3b73
kad-dht/README: Replicate record to closest peers without it
mxinden Jun 25, 2021
3e13846
kad-dht/README: Demote validate purity to `should`
mxinden Jun 25, 2021
6ec65b5
kad-dht/README: Do not require storing provider addresses
mxinden Jun 25, 2021
1dcb218
kad-dht/README: Remove periodic record pruning section
mxinden Jun 25, 2021
c755a41
kad-dht/README: Include bootstrap lookup for oneself
mxinden Jun 25, 2021
e9c18bd
kad-dht/README: Make k value recommended instead of default
mxinden Jun 25, 2021
dab4549
kad-dht/README: Always return k closest peers with FIND_NODE
mxinden Jun 25, 2021
324f915
kad-dht/README: Extend on reasoning for quorums
mxinden Jun 25, 2021
dbd17a2
kad-dht/README: Stress republishing to close nodes once more
mxinden Jun 25, 2021
3e6f8f5
kad-dht/README: Add disclaimer for bootstrap process
mxinden Jun 28, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
395 changes: 395 additions & 0 deletions kad-dht/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,395 @@
# libp2p Kademlia DHT specification

The Kademlia Distributed Hash Table (DHT) subsystem in libp2p is a DHT
implementation largely based on the Kademlia [0] whitepaper, augmented with
notions from S/Kademlia [1], Coral [2] and mainlineDHT \[3\].
mxinden marked this conversation as resolved.
Show resolved Hide resolved

This specification assumes the reader has prior knowledge of those systems. So
rather than explaining DHT mechanics from scratch, we focus on differential
areas:

1. Specialisations and peculiarities of the libp2p implementation.
2. Actual wire messages.
3. Other algorithmic or non-standard behaviours worth pointing out.

For everything else that isn't explicitly stated herein, it is safe to assume
behaviour similar to Kademlia-based libraries.

Code snippets use a Go-like syntax.

## Authors

* Protocol Labs.

## Editors

* [Raúl Kripalani](https://github.com/raulk)
* [John Hiesey](https://github.com/jhiesey)

## Distance function (dXOR)

The libp2p Kad DHT uses the **XOR distance metric** as defined in the original
Kademlia paper [0]. Peer IDs are normalised through the SHA256 hash function.

For recap, `dXOR(sha256(id1), sha256(id2))` is the number of common leftmost
bits between SHA256 of each peer IDs. The `dXOR` between us and a peer X
mxinden marked this conversation as resolved.
Show resolved Hide resolved
designates the bucket index that peer X will take up in the Kademlia routing
table.

## Kademlia routing table

The data structure backing this system is a k-bucket routing table, closely
following the design outlined in the Kademlia paper [0]. The default value for
`k` is 20, and the maximum bucket count matches the size of the SHA256 function,
i.e. 256 buckets.

The routing table is unfolded lazily, starting with a single bucket a position 0
(representing the most distant peers), and splitting it subsequently as closer
peers are found, and the capacity of the nearmost bucket is exceeded.
mxinden marked this conversation as resolved.
Show resolved Hide resolved

## Alpha concurrency factor (α)

The concurrency of node and value lookups are limited by parameter `α`, with a
default value of 3. This implies that each lookup process can perform no more
than 3 inflight requests, at any given time.
mxinden marked this conversation as resolved.
Show resolved Hide resolved

## Record keys

Records in the DHT are keyed by CID [4], roughly speaking. There are intentions
to move to multihash [5] keys in the future, as certain CID components like the
multicodec are redundant. This will be an incompatible change.

The format of `key` varies depending on message type; however, in all cases
`dXOR(sha256(key1), sha256(key2))` see [Distance function](#distance-function-dxor)
is used as the distance between two keys.

* For `GET_VALUE` and `PUT_VALUE`, `key` is an unstructured array of bytes, except
if it is being used to look up a public key for a `PeerId`, in which case it is
the ASCII string '/pk/' concatenated with the binary `PeerId`.
* For `ADD_PROVIDER` and `GET_PROVIDERS`, `key` is interpreted and validated as
a CID.
* For `FIND_NODE`, `key` is a binary `PeerId`

## Interfaces

The libp2p Kad DHT implementation satisfies the routing interfaces:
mxinden marked this conversation as resolved.
Show resolved Hide resolved

```go
type Routing interface {
ContentRouting
PeerRouting
ValueStore

// Kicks off the bootstrap process.
Bootstrap(context.Context) error
}

// ContentRouting is used to find information about who has what content.
type ContentRouting interface {
// Provide adds the given CID to the content routing system. If 'true' is
// passed, it also announces it, otherwise it is just kept in the local
// accounting of which objects are being provided.
Provide(context.Context, cid.Cid, bool) error

// Search for peers who are able to provide a given key.
FindProvidersAsync(context.Context, cid.Cid, int) <-chan pstore.PeerInfo
}

// PeerRouting is a way to find information about certain peers.
//
// This can be implemented by a simple lookup table, a tracking server,
// or even a DHT (like herein).
type PeerRouting interface {
// FindPeer searches for a peer with given ID, returns a pstore.PeerInfo
// with relevant addresses.
raulk marked this conversation as resolved.
Show resolved Hide resolved
FindPeer(context.Context, peer.ID) (pstore.PeerInfo, error)
}

// ValueStore is a basic Put/Get interface.
type ValueStore interface {
// PutValue adds value corresponding to given Key.
PutValue(context.Context, string, []byte, ...ropts.Option) error

// GetValue searches for the value corresponding to given Key.
GetValue(context.Context, string, ...ropts.Option) ([]byte, error)
}
```

## Value lookups

When looking up an entry in the DHT, the implementor should collect at least `Q`
(quorum) responses from distinct nodes to check for consistency before returning
an answer.

Should the responses be different, the `Validator.Select()` function is used to
mxinden marked this conversation as resolved.
Show resolved Hide resolved
resolve the conflict and select the _best_ result.
mxinden marked this conversation as resolved.
Show resolved Hide resolved

**Entry correction.** Nodes that returned _worse_ records are updated via a
direct `PUT_VALUE` RPC call when the lookup completes. Thus the DHT network
eventually converges to the best value for each record, as a result of nodes
collaborating with one another.

### Algorithm

Let's assume we’re looking for key `K`. We first try to fetch the value from the local store. If found, and `Q == { 0, 1 }`, the search is complete.

Otherwise, the local result counts for one towards the search of `Q` values. We then enter an iterative network search.

We keep track of:

* the number of values we've fetched (`cnt`).
* the best value we've found (`best`), and which peers returned it (`Pb`)
* the set of peers we've already queried (`Pq`) and the set of next query candidates sorted by distance from `K` in ascending order (`Pn`).
* the set of peers with outdated values (`Po`).

**Initialization**: seed `Pn` with the `α` peers from our routing table we know are closest to `K`, based on the XOR distance function.

**Then we loop:**

*WIP (raulk): lookup timeout.*

1. If we have collected `Q` or more answers, we cancel outstanding requests, return `best`, and we notify the peers holding an outdated value (`Po`) of the best value we discovered, by sending `PUT_VALUE(K, best)` messages. _Return._
mxinden marked this conversation as resolved.
Show resolved Hide resolved
2. Pick as many peers from the candidate peers (`Pn`) as the `α` concurrency factor allows. Send each a `GET_VALUE(K)` request, and mark it as _queried_ in `Pq`.
3. Upon a response:
1. If successful, and we receive a value:
1. If this is the first value we've seen, we store it in `best`, along
with the peer who sent it in `Pb`.
2. Otherwise, we resolve the conflict by calling `Validator.Select(best,
new)`:
1. If the new value wins, store it in `best`, and mark all formerly
“best" peers (`Pb`) as _outdated peers_ (`Po`). The current peer
becomes the new best peer (`Pb`).
2. If the new value loses, we add the current peer to `Po`.
2. If successful without a value, the response will contain the closest
nodes the peer knows to the key `K`. Add them to the candidate list `Pn`,
except for those that have already been queried.
mxinden marked this conversation as resolved.
Show resolved Hide resolved
3. If an error or timeout occurs, discard it.
4. Go to 1.

## Entry validation

When constructing a DHT node, it is possible to supply a record `Validator`
object conforming to this interface:

``` // Validator is an interface that should be implemented by record
validators. type Validator interface {

// Validate validates the given record, returning an error if it's
// invalid (e.g., expired, signed by the wrong key, etc.).
Validate(key string, value []byte) error

// Select selects the best record from the set of records (e.g., the
// newest).
//
// Decisions made by select should be stable.
Select(key string, values [][]byte) (int, error)
}
```

`Validate()` is a pure function that reports the validity of a record. It may
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (feel free to ignore): perhaps "should"? For a while go-ipfs had a non-pure function for IPNS records since it relied on finding the corresponding public key. As mentioned above those keys are now embedded.

I totally agree that having non-pure functions here is a bit of a disaster, so maybe leaving it as "is" makes the most sense

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Demoted to should with 3e13846. Thanks for the background @aschmahmann.

To be honest, I don't think there is much value in including this sample interface in the first place, given that such interface is likely only idiomatic in its own language. E.g. rust-libp2p is actor / message-passing based, thus a synchronous interface would not fit. That said, and given that this has been discussed in this pull request at length in the past, I don't mind including it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, it's probably fine. Once we've got something we can start to clean it up, including removing language specific details that seem extraneous in a spec.

validate a cryptographic signature, or else. It is called on two occasions:

1. To validate incoming values in response to `GET_VALUE` calls.
2. To validate outgoing values before storing them in the network via
`PUT_VALUE` calls.
mxinden marked this conversation as resolved.
Show resolved Hide resolved

Similarly, `Select()` is a pure function that returns the best record out of 2
or more candidates. It may use a sequence number, a timestamp, or other
heuristic to make the decision.
mxinden marked this conversation as resolved.
Show resolved Hide resolved

## Public key records

Apart from storing arbitrary values, the libp2p Kad DHT stores node public keys
in records under the `/pk` namespace. That is, the entry `/pk/<peerID>` will
store the public key of peer `peerID`.

DHT implementations may optimise public key lookups by providing a
`GetPublicKey(peer.ID) (ci.PubKey)` method, that, for example, first checks if
the key exists in the local peerstore.

The lookup for public key entries is identical to a standard entry lookup,
except that a custom `Validator` strategy is applied. It checks that equality
`SHA256(value) == peerID` stands when:

1. Receiving a response from a `GET_VALUE` lookup.
2. Storing a public key in the DHT via `PUT_VALUE`.

The record is rejected if the validation fails.

## Provider records

Nodes must keep track of which nodes advertise that they provide a given key
(CID). These provider advertisements should expire, by default, after 24 hours.
These records are managed through the `ADD_PROVIDER` and `GET_PROVIDERS`
messages.

When `Provide(key)` is called, the DHT finds the closest peers to `key` using
the `FIND_NODE` RPC, and then sends a `ADD_PROVIDER` RPC with its own
`PeerInfo` to each of these peers.

Each peer that receives the `ADD_PROVIDER` RPC should validate that the
received `PeerInfo` matches the sender's `peerID`, and if it does, that peer
must store a record in its datastore the received `PeerInfo` record.

When a node receives a `GET_PROVIDERS` RPC, it must look up the requested
key in its datastore, and respond with any corresponding records in its
datastore, plus a list of closer peers in its routing table.

For performance reasons, a node may prune expired advertisements only
periodically, e.g. every hour.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generic to all Puts/Gets. Your implementation can do whatever as long as you only return valid data.

Your network may have implied and/or explicit expirations times per record type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an artifact of past versions of this pull request. Thinking about it some more, I am in favor of removing it. To me it is implementation specific. 1dcb218 removes it.

@aschmahmann let me know if you disagree.


## Node lookups

_WIP (raulk)._

## Bootstrap process
The bootstrap process is responsible for keeping the routing table filled and
healthy throughout time. It runs once on startup, then periodically with a
configurable frequency (default: 5 minutes).

On every run, we generate a random node ID and we look it up via the process
mxinden marked this conversation as resolved.
Show resolved Hide resolved
defined in *Node lookups*. Peers encountered throughout the search are inserted
in the routing table, as per usual business.

This process is repeated as many times per run as configuration parameter
`QueryCount` (default: 1). Every repetition is subject to a `QueryTimeout`
(default: 10 seconds), which upon firing, aborts the run.

## RPC messages

See [protobuf
definition](https://github.com/libp2p/go-libp2p-kad-dht/blob/master/pb/dht.proto)

On any error, the entire stream is reset. This is probably not the behavior we
want.

All RPC messages conform to the following protobuf:
```protobuf
// Record represents a dht record that contains a value
// for a key value pair
message Record {
// The key that references this record
bytes key = 1;

// The actual value this record is storing
bytes value = 2;

// Note: These fields were removed from the Record message
// hash of the authors public key
mxinden marked this conversation as resolved.
Show resolved Hide resolved
//optional string author = 3;
// A PKI signature for the key+value+author
//optional bytes signature = 4;

// Time the record was received, set by receiver
string timeReceived = 5;
};

message Message {
enum MessageType {
PUT_VALUE = 0;
GET_VALUE = 1;
ADD_PROVIDER = 2;
GET_PROVIDERS = 3;
FIND_NODE = 4;
PING = 5;
}

enum ConnectionType {
// sender does not have a connection to peer, and no extra information (default)
NOT_CONNECTED = 0;

// sender has a live connection to peer
CONNECTED = 1;

// sender recently connected to peer
CAN_CONNECT = 2;

// sender recently tried to connect to peer repeatedly but failed to connect
// ("try" here is loose, but this should signal "made strong effort, failed")
CANNOT_CONNECT = 3;
}

message Peer {
// ID of a given peer.
bytes id = 1;

// multiaddrs for a given peer
repeated bytes addrs = 2;

// used to signal the sender's connection capabilities to the peer
ConnectionType connection = 3;
}

// defines what type of message it is.
MessageType type = 1;

// defines what coral cluster level this query/response belongs to.
// in case we want to implement coral's cluster rings in the future.
int32 clusterLevelRaw = 10; // NOT USED

// Used to specify the key associated with this message.
// PUT_VALUE, GET_VALUE, ADD_PROVIDER, GET_PROVIDERS
bytes key = 2;

// Used to return a value
// PUT_VALUE, GET_VALUE
Record record = 3;

// Used to return peers closer to a key in a query
// GET_VALUE, GET_PROVIDERS, FIND_NODE
repeated Peer closerPeers = 8;

// Used to return Providers
// GET_VALUE, ADD_PROVIDER, GET_PROVIDERS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// GET_VALUE, ADD_PROVIDER, GET_PROVIDERS
// ADD_PROVIDER, GET_PROVIDERS

This seems to be a copy-paste error from the above closerPeers field. Am I missing something?

repeated Peer providerPeers = 9;
}
```

Any time a relevant `Peer` record is encountered, the associated multiaddrs
are stored in the node's peerbook.

These are the requirements for each `MessageType`:
* `FIND_NODE`: `key` must be set in the request. `closerPeers` is set in the
response; for an exact match exactly one `Peer` is returned; otherwise `ncp`
mxinden marked this conversation as resolved.
Show resolved Hide resolved
(default: 6) closest `Peer`s are returned.

* `GET_VALUE`: `key` must be set in the request. If `key` is a public key
(begins with `/pk/`) and the key is known, the response has `record` set to
that key. Otherwise, `record` is set to the value for the given key (if found
in the datastore) and `closerPeers` is set to indicate closer peers.

* `PUT_VALUE`: `key` and `record` must be set in the request. The target
node validates `record`, and if it is valid, it stores it in the datastore.

* `GET_PROVIDERS`: `key` must be set in the request. The target node returns
the closest known `providerPeers` (if any) and the closest known `closerPeers`.

* `ADD_PROVIDER`: `key` and `providerPeers` must be set in the request. The
target node verifies `key` is a valid CID, all `providerPeers` that
match the RPC sender's PeerID are recorded as providers.

* `PING`: Target node responds with `PING`. Nodes should respond to this
message but it is currently never sent.

# Appendix A: differences in implementations

The `addProvider` handler behaves differently across implementations:
* in js-libp2p-kad-dht, the sender is added as a provider unconditionally.
* in go-libp2p-kad-dht, it is added once per instance of that peer in the
`providerPeers` array.
mxinden marked this conversation as resolved.
Show resolved Hide resolved

---

# References

[0]: Maymounkov, P., & Mazières, D. (2002). Kademlia: A Peer-to-Peer Information System Based on the XOR Metric. In P. Druschel, F. Kaashoek, & A. Rowstron (Eds.), Peer-to-Peer Systems (pp. 53–65). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-45748-8_5

[1]: Baumgart, I., & Mies, S. (2014). S / Kademlia : A practicable approach towards secure key-based routing S / Kademlia : A Practicable Approach Towards Secure Key-Based Routing, (June). https://doi.org/10.1109/ICPADS.2007.4447808

[2]: Freedman, M. J., & Mazières, D. (2003). Sloppy Hashing and Self-Organizing Clusters. In IPTPS. Springer Berlin / Heidelberg. Retrieved from www.coralcdn.org/docs/coral-iptps03.ps

[3]: [bep_0005.rst_post](http://bittorrent.org/beps/bep_0005.html)

[4]: [GitHub - ipld/cid: Self-describing content-addressed identifiers for distributed systems](https://github.com/ipld/cid)

[5]: [GitHub - multiformats/multihash: Self describing hashes - for future proofing](https://github.com/multiformats/multihash)