Skip to content
This repository has been archived by the owner on Feb 1, 2023. It is now read-only.

Feat: Track Session Peer Latency More Accurately #149

Merged
merged 4 commits into from
Jul 15, 2019

Conversation

hannahhoward
Copy link
Contributor

Goals

Track the speeds between session peers more accurately

Implementation

The current SessionPeerManager uses a very simple algorithm for sorting peers by optimization -- it simply orders them by last block received.

This algorithm attempts to actually track peers length of time to respond to requests, and sort them not only by which is fastest, by provide useful information (and optimization rating from 1-0, with 1 being the fastest peer) about how they relate to each other.

The steps are as follows:

  1. For a broadcast request, track all responses (not just first one) until a preset timeout period (5 seconds for now) and use that to establish optimization ratings between peers
  2. For targeted requests, individually measure time for each peer requested up to a timeout period. If response is received, track total time. If no response is received, but a cancel was sent (usually cause another peer responded first), ignore. If no response is received before a timeout, and no cancel was sent, record the full timeout period as it's latency.
  3. Prioritize latency of last response (0.5 * last response + 0.5 * previous latency rating)

For Discussion

  • Is this too complicated and specific (doesn't feel that way to me -- feels like we should produce the best information possible)
  • Is the fall off right (0.5)?
  • Is the timeout right (5 seconds)?
  • Does the logic around which timeouts matter make sense?

Return optimized peers in real latency order, weighted toward recent
requests
When fetching optimized peers from the peer manager, return an optimization rating, and pass on to
request splitter

BREAKING CHANGE: interface change to GetOptimizedPeers and SplitRequests public package methods
Better estimate latency per peer by tracking cancellations
send duplicate responses to the session peer manager to track latencies
@hannahhoward
Copy link
Contributor Author

hannahhoward commented Jul 4, 2019

Note marked improvement on time for some benchmarks:

BenchmarkDups2Nodes/AllToAll-OneAtATime-2                                                           2071401035      2072688572     +0.06%
BenchmarkDups2Nodes/AllToAll-BigBatch-2                                                             88909019        90046606       +1.28%
BenchmarkDups2Nodes/Overlap1-OneAtATime-2                                                           2632222013      2632567139     +0.01%
BenchmarkDups2Nodes/Overlap2-BatchBy10-2                                                            820683679       820798869      +0.01%
BenchmarkDups2Nodes/Overlap3-OneAtATime-2                                                           2627422739      2067928550     -21.29%
BenchmarkDups2Nodes/Overlap3-BatchBy10-2                                                            822213067       818108666      -0.50%
BenchmarkDups2Nodes/Overlap3-AllConcurrent-2                                                        707189322       701354193      -0.83%
BenchmarkDups2Nodes/Overlap3-BigBatch-2                                                             701004548       693238080      -1.11%
BenchmarkDups2Nodes/Overlap3-UnixfsFetch-2                                                          692404913       215097237      -68.93%
BenchmarkDups2Nodes/10Nodes-AllToAll-OneAtATime-2                                                   2069193746      2075425311     +0.30%
BenchmarkDups2Nodes/10Nodes-AllToAll-BatchFetchBy10-2                                               241809647       243263661      +0.60%
BenchmarkDups2Nodes/10Nodes-AllToAll-BigBatch-2                                                     98872270        96828694       -2.07%
BenchmarkDups2Nodes/10Nodes-AllToAll-AllConcurrent-2                                                95828461        95103353       -0.76%
BenchmarkDups2Nodes/10Nodes-AllToAll-UnixfsFetch-2                                                  115383212       114733473      -0.56%
BenchmarkDups2Nodes/10Nodes-OnePeerPerBlock-OneAtATime-2                                            6552511357      6558910244     +0.10%
BenchmarkDups2Nodes/10Nodes-OnePeerPerBlock-BigBatch-2                                              1281881927      1309517705     +2.16%
BenchmarkDups2Nodes/10Nodes-OnePeerPerBlock-UnixfsFetch-2                                           1110855308      1108554936     -0.21%
BenchmarkDups2Nodes/200Nodes-AllToAll-BigBatch-2                                                    907350546       957346823      +5.51%
BenchmarkDupsManyNodesRealWorldNetwork/200Nodes-AllToAll-BigBatch-FastNetwork-2                     2642276485      2375770917     -10.09%
BenchmarkDupsManyNodesRealWorldNetwork/200Nodes-AllToAll-BigBatch-AverageVariableSpeedNetwork-2     4176594592      3007236195     -28.00%
BenchmarkDupsManyNodesRealWorldNetwork/200Nodes-AllToAll-BigBatch-SlowVariableSpeedNetwork-2        13381514550     7773090900     -41.91%

Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks awesome! My main comments are:

  1. Let's document the interfaces (when certain functions should be called).
  2. Can we write a benchmark that issues many parallel requests for unavailable content? This adds some complicated per-CID logic so I'm a bit worried about issues like #154.

// OptimizedPeer describes a peer and its level of optimization from 0 to 1.
type OptimizedPeer struct {
Peer peer.ID
OptimizationRating float64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we comment on what this rating means?

request, ok := lt.requests[key]
var latency time.Duration
if ok {
latency = time.Now().Sub(request.startedAt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: time.Since(request.startedAt)


func (ptm *peerTimeoutMessage) handle(spm *SessionPeerManager) {
data, ok := spm.activePeers[ptm.p]
if !ok || !data.lt.WasCancelled(ptm.k) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be ok && !data.lt.....? That is, do we want to record timeouts for inactive peers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(that is, won't this add these peers to the active set?)

@Stebalien
Copy link
Member

IMO, the constants are fine. Ideally, we'd some how "learn" them but I can't think of a simple way to do so.

@Stebalien
Copy link
Member

BenchmarkDupsManyNodesRealWorldNetwork/200Nodes-AllToAll-BigBatch-FastNetwork-2                     2642276485      2375770917     -10.09%
BenchmarkDupsManyNodesRealWorldNetwork/200Nodes-AllToAll-BigBatch-AverageVariableSpeedNetwork-2     4176594592      3007236195     -28.00%
BenchmarkDupsManyNodesRealWorldNetwork/200Nodes-AllToAll-BigBatch-SlowVariableSpeedNetwork-2        13381514550     7773090900     -41.91%

This ^^ really shows that it's working. That's exactly what I'd expect from latency tracking. This is going to be a really nice boost. ❤️

@Kubuxu
Copy link
Member

Kubuxu commented Jul 6, 2019

Prioritize latency of last response (0.5 * last response + 0.5 * previous latency rating)

From my experience with digital signal processing and networking systems, the alpha parameter of 0.5 in exponential moving average seems high.

Could you set up logging (probably one separate logger) with raw data so we can try tweaking these parameters? As an example, one lost packet over TCP will incur 2-3 RTTs of jitter over normal latency.


From my simulations in matlab, with network connection defined by Rayleigh distribution, packet drop probability of 0.5% and EMA for latency tracking, 0.5 seems too high. Take a look:
EMA

Especially that we track only blocks transferred and not latency in general.

Matlab script for those interested: https://gist.github.com/Kubuxu/5c58022d7af6b1f3dfb66f0eae5a730c

@Stebalien
Copy link
Member

I'm going to merge this as strictly better than what we have. Given our new release process, I'm confident that we'll catch any regressions (if any) before they hit users.

@Stebalien Stebalien merged commit 8f0e4c6 into master Jul 15, 2019
Jorropo pushed a commit to Jorropo/go-libipfs that referenced this pull request Jan 26, 2023
…cking

Feat: Track Session Peer Latency More Accurately

This commit was moved from ipfs/go-bitswap@8f0e4c6
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants