Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analysis: k-bucket underutilisation and excessive thrashing #194

Closed
raulk opened this issue Sep 6, 2018 · 17 comments
Closed

Analysis: k-bucket underutilisation and excessive thrashing #194

raulk opened this issue Sep 6, 2018 · 17 comments
Labels
need/community-input Needs input from the wider community research status/deferred Conscious decision to pause or backlog

Comments

@raulk
Copy link
Member

raulk commented Sep 6, 2018

This ticket summarises my findings around bucket utilisation and peer thrashing in IPFS DHT, driven by go-libp2p-kad-dht.


Problem statement

As peers are discovered in the network, Kademlia stores them in a routing table logically composed of as many buckets as bits in the peer ID (or rather, its hash). I say logically because implementations can choose to physically store the table differently (e.g. unfolding lazily, like libp2p Kad DHT).

Each bucket represents a distance from our node, calculated by XORing the hash of the peer's ID with ours (key from now on) and counting leading 0, i.e. boils down to how many prefix bits our keys share. This is called distance (rather, closeness). The higher the value, the closer they are to us.

As a result, buckets follow a log2n distribution. The furthest bucket represents 1/2 of the addressable network, and each step closer halves that.

Soon enough, the subtree represented by a bucket becomes so small that it is highly improbable to find a peer there. In fact, the Ethereum network bonks out pretty soon, at bucket 8 (if 0 is furthest), or bucket 248 (if 0 is closest aka. us).

This phenomenon is discussed in section II(d) of the "Low-Resource Eclipse Attacks
on Ethereum’s Peer-to-Peer Network" paper
.

The below chart is extracted from that paper. It shows a 27-day run, in which the first 6-7 buckets were filled very quickly, with next buckets taking hours, and the remaining buckets being permanently vacant. Only ~4% of the routing table was ever utilised.

image

Assessing libp2p Kad DHT

I set out to analyse if this behaviour is present in the IPFS DHT. I instrumented go-libp2p-kad-dht and go-libp2p-kbucket with two types of log statements:

  • dump the routing table every 10 seconds; code.
  • log whenever we encounter a peer for a bucket that's full – which forces eviction; code.

I performed two runs of the IPFS daemon:

Observations:

  • the table remains almost empty for the first minute. I suspect we kick off the bootstrapping/seeding process 1 minute from start, but I have to look for evidence.
  • we begin dropping peers just as the bootstrapping begins.
  • we discarded 841 peers in the 3-minute run, only retained ~140, and filled 6 buckets out of 256 + a partial bucket.
  • we discarded 8436 peers in the 70-minute run, only retained ~160, and filled 7 buckets out of 256 + a partial bucket.
  • the number of buckets and fullness settles at the 1-2 minute mark with 7 buckets, and remains constant until minute 30, where a new bucket is unfolded. After that, no more changes.
  • we stabilise on a table structure very soon, but the bucket contents are thrashing constantly.

The libp2p Kad DHT implementation does a good job of unfolding buckets lazily as they are utilised, versus eagerly initialising all buckets at startup – although I'd argue the memory impact of one vs. the other is negligible.

Impact

It's clear to me that the linear k-bucket routing table an inefficient data structure in this context. On the bright side, it does set a heuristic to arrange the network (based on distance), although arguably a mediocre one.

The problem emerges when you combine these factors:

  • a distance formula yielding a log2n distribution.
  • a constant size limitation per bucket (k).
  • a policy to evict peers on full buckets.

Now you have a routing table imposing the same k limitation to the buckets holding 50% of the network and the buckets holding 0.0001% of the network. This makes little sense.

Furthermore, every entry beyond k is an opportunity to evict a peer from the bucket. Kademlia proposes to test the liveness of the least recently seen peer, although in libp2p Kad DHT we take a different approach (replacing the least recently seen immediately). This has several negative trade-offs:

  • you are testing peers too often (every time a new peer is found), and remember that 50% of peers fall in the furthest bucket, so that bucket receives a lot of testing pressure.
  • you are thrashing buckets too often, replacing the least recently seen peer immediately with the incoming one (libp2p Kad DHT's behaviour).
  • you are forgetting which peers you've seen / evaluated for inclusion before, so if you encounter an old peer again, you are going to repeat the same process, over and over again.

Wasted discoveries

During DHT lookups and recurring table filling operations, we encounter lots of peers. Peers that could be useful. But we discard them.

Peer entries probably occupy less than 100 bytes, so storing the 8000 peers we discarded could've entailed only 800kb. A worthwhile investment if this gives us fast-track direct access (teleport) to arbitrary regions of the DHT.

Actually, it is worthy noting that 7 buckets cover 99% of the network. That's where the meat of the matter is. We should avoid on-demand iterative searches as much as possible, as they are inefficient and synchronous. Rather populate our DHT picture in the background, in a way that allows us to teleport when lookups are demanded.

Some ideas:

  1. Further segmenting buckets into sub-buckets or trees that store sub-distances, e.g. a node with XOR distance 11010000 could be stored in bucket 1 (furthest from us), segment 13 (1101), and each bucket could have an invertible bloom filter to check membership quickly.
  2. Keeping the single-tier structure, but using adaptable bucket lengths according to some heuristic.
  3. Replacing the k-bucket table data structure with something totally different that still preserves the distance-based access on some level, i.e. keep the outward Kademlia API but change the mechanics. Some kind of tree.

To me, (1) and (3) are good paths forward here.

References

[1] https://eprint.iacr.org/2018/236.pdf
[2] raulk/go-libp2p-kbucket@4acf71e
[3] raulk@f5c1563
[4] https://gist.github.com/raulk/95b269072e75a71f84a1bcf2fa237247
[5] https://gist.github.com/raulk/17aba7f263bba54c83cda1269e6bf5a7

@raulk
Copy link
Member Author

raulk commented Sep 6, 2018

Calling out a few people to get the discussion started here: @jhiesey @bigs @Stebalien @whyrusleeping @mgoelzer @carospiegly @diasdavid @jbenet.

@bigs
Copy link
Contributor

bigs commented Sep 6, 2018

would be interested in dropping in a trie or an ordered map (supporting lookup LTE or GTE operations)

@carospiegly
Copy link

As a side note, (we totally might have talked about this yesterday) do we choose not to have a preference for established nodes because of security? I thought that kademlia keeps established nodes around over new nodes?
@bigs would the ordered map have distance buckets as the keys?

@raulk
Copy link
Member Author

raulk commented Sep 6, 2018

@carospiegly I think our current protocol doesn’t use ping and pong messages – I vaguely recall hearing that (afk at the moment so I can’t check). Without that ability, we can’t test for node liveness (unless we issue a query). According to the Kademlia spec, when a bucket is full you pick an existing node and test if it’s still alive. If it is, you reject the incoming one, thus giving preference to elder nodes.

In Ethereum this is problematic because queries are only answered if the requester is in your table, so new nodes find it relatively hard to join the network IIRC.

@raulk
Copy link
Member Author

raulk commented Sep 6, 2018

@carospiegly Another thought I’m having is that the Kademlia RPC protocol never refers to distances. So as long as we can honour the behaviour of returning nodes close to a target, I think we’re good.

In other words, the arrangement by distance is purely local!

Additional data structures to consider include: skip lists, trees. We’ll probably need an invertible bloom filter for fast membership tests.

@bigs
Copy link
Contributor

bigs commented Sep 6, 2018

it would have IDs as keys and could provide nlogn lookups for closest peers

@bigs
Copy link
Contributor

bigs commented Sep 6, 2018

@raulk exactly. all we need is to find nearest ids!

@ghost
Copy link

ghost commented Sep 9, 2018

@Stebalien

@bigs bigs added need/community-input Needs input from the wider community research status/deferred Conscious decision to pause or backlog labels Sep 18, 2018
@bennofs
Copy link

bennofs commented Sep 19, 2018

The paper Sub-Second Lookups on a Large-Scale Kademlia-Based Overlay mentions a technique to counteract the log-distribution of buckets: make the bucket for the largest distance have more entries (they changed the bucket size of the 4 farthest away buckets from 8 to 64,32,16,8). This resulted in significant decrease in lookup latency for their tests in the kademlia MainlineDHT and might be an idea worth looking at.

@raulk
Copy link
Member Author

raulk commented Sep 20, 2018

@bennofs thanks a lot for the reference! Their observations are consistent with my findings.

Personally I'm thinking of a complete redesign of the underlying data structure towards a segmented heap of some kind. I am not entirely convinced the concept of "distance" is as helpful as it claims to be.

I'm thinking of a data structure that:

  • Indexes peers by the subtree of the network they give us access to.
  • Maintains a balance of # of peers across all subtrees, to whatever depth (which can be adaptive). So this structure "rolls off" gradually up to a maximum depth.
  • Scores peers according to some heuristic (latency, failure rate, answer quality, etc.), or a mix of them.
  • Contains a "probing region" and a "quarantine region".
    • Newly discovered peers make it into the probing region, where they are tested for their characteristics based on above heuristics. If they prove to be better than existing peers in their subtree, they are replaced in the main segment for that subtree.
    • Displaced peers are placed into "quarantine" where they are given a chance to prove they can do better than previous ones.
  • We maintain a "eden region" for new peers – this gives new nodes a chance to get up to speed with the network and prove their usefulness, before the standard heuristics kick in.
  • A periodic process probes all peers in the main region if they have been inactive for some time, to confirm their availability. If they have to be replaced, we query other nodes in that subtree to fill up the vacant spot.

This is just off the top of my head, happy to hear thoughts.

@bennofs
Copy link

bennofs commented Sep 20, 2018

@raulk I cannot comment on your proposal (I don't fully grok Kademlia yet :) but I found another explanation for the bucket trashing:

If we look at the implementation of RoutingTable.Update, there is the following code:

	// New peer, add to bucket
	bucket.PushFront(p)
	rt.PeerAdded(p)

	// Are we past the max bucket size?
	if bucket.Len() > rt.bucketsize {
		// If this bucket is the rightmost bucket, and its full
		// we need to split it and create a new bucket
		if bucketID == len(rt.Buckets)-1 {
			rt.nextBucket()
		} else {
			// If the bucket cant split kick out least active node
			rt.PeerRemoved(bucket.PopBack())
		}
	}

The only path in which the new peer is not added to the bucket is if either it already is in the bucket or its latency is too high.

From this code, it appears to me that we are not at all implementing the "prefer older peers" from kademlia: every new node is added to a bucket (we remove the least active node, but least active != not connected anymore). So if you are flooded with nodes that come online and then go offline again quickly (which for new nodes is likely, consider someone just trying ipfs for a small amount of time) then the bucket is trashed.

@raulk
Copy link
Member Author

raulk commented Sep 20, 2018

@bennofs Indeed. This is what I explained – albeit maybe not clearly – in my point above:

Kademlia proposes to test the liveness of the least recently seen peer, although in libp2p Kad DHT we take a different approach (replacing the least recently seen immediately). This has several negative trade-offs:
[...]

  • you are thrashing buckets too often, replacing the least recently seen peer immediately with the incoming one (libp2p Kad DHT's behaviour).

Looking to get involved with libp2p? Would be happy to collaborate on this issue, or on others ;-)

@anacrolix
Copy link
Contributor

My instinct is precisely what @bennofs described. Thrashing occurs because evictions are too easy. An existing implementation avoids this by evicting bad nodes if the bucket is full. See https://github.com/anacrolix/dht/blob/b09db78595aaba14cd992537fbe992a4c5e0a141/server.go#L441-L450 and https://github.com/anacrolix/dht/blob/master/server.go#L457-L478 for examples of how this is done in other implementations.

@anacrolix
Copy link
Contributor

I've been experimenting with this, the biggest contributor is disconnects causing evictions from the routing table, followed by a bad bootstrapping process, particularly the random walks, which mainly flood the first bucket.

@anacrolix
Copy link
Contributor

Can we reduce this issue to something actionable? Specifically I propose #45, an issue in kbucket for a better datastructure based on research per @bennofs suggestion, and tangible items to improve bootstrap cycles.

@raulk
Copy link
Member Author

raulk commented Jun 26, 2019

@aschmahmann - you might be interested in this, particularly #194 (comment)

@Stebalien
Copy link
Member

Fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/community-input Needs input from the wider community research status/deferred Conscious decision to pause or backlog
Projects
None yet
Development

No branches or pull requests

6 participants