-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to random load balance to >1 server nodes #631
base: dev
Are you sure you want to change the base?
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @kamran-m
Some changes I'd like to see:
- Benchmarking to show the performance before and after this change, since currently there's some perf regressions.
- Tests that cover all the different cases (peerConnectionCount = 0, peerConnectionCount = 1, peerConnectionCount < available peers, peerConnectionCount > available peers, etc)
peer.go
Outdated
@@ -175,8 +186,8 @@ func (l *PeerList) Remove(hostPort string) error { | |||
return nil | |||
} | |||
func (l *PeerList) choosePeer(prevSelected map[string]struct{}, avoidHost bool) *Peer { | |||
var psPopList []*peerScore | |||
var ps *peerScore | |||
var chosenPSList []*peerScore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
presize chosenPSList to avoid a bunch of allocations on each selection, especially when it's > 1
peer.go
Outdated
ps.chosenCount.Inc() | ||
return ps.Peer | ||
} | ||
|
||
func randomSampling(psList []*peerScore) *peerScore { | ||
peerRand := trand.NewSeeded() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please do not create a new rand for every single peer selection request.
peer.go
Outdated
} | ||
|
||
ps := randomSampling(chosenPSList) | ||
if ps == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would we need this check?
Thanks for the review @prashantv , I addressed your comments. I also added some benchmark tests with these results:
and also used the same test in the old code to compare with
Please let me know if the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay in reviewing @kamran-m, I was OOO and was catching up after I got back.
I've been thinking a little bit about whether this is the right way to do it -- I don't know we should put every strategy into the TChannel core library. For example, see another strategy that we'd like: #640
If instead of this change, we just ensured that TChannel had 2 connected peers, then:
- No changes are needed to TChannel
- You get the standard peer selection (least pending, with round-robin fallback if the scores are the same).
That would likely have even better load distribution.
peer.go
Outdated
@@ -48,6 +47,9 @@ var ( | |||
// ErrNoNewPeers indicates that no previously unselected peer is available. | |||
ErrNoNewPeers = errors.New("no new peer available") | |||
|
|||
// ErrZeroPeerConnectionCount indicates that the peer connection count is set to zero. | |||
ErrZeroPeerConnectionCount = errors.New("peer connection count must be greater than 0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason to export this error -- once you export an error, it's part of the public API and we can never change it. There shouldn't be a good reason for a user to compare a returned error against this, so it shouldn't be exported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense!
|
||
for i := 0; i < b.N; i++ { | ||
peer, _ := ch.Peers().Get(nil) | ||
if peer == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this ever happen? If not, maybe we should do a t.Fatal
instead of a println
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't, I just added it to guard against compiler optimization not to artificially lower the runtime of the benchmark. Changed it to Fatal!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't, I just added it to guard against compiler optimization not to artificially lower the runtime of the benchmark. Changed it to Fatal!
peer_test.go
Outdated
|
||
func BenchmarkGetPeerWithPeerConnectionCount10(b *testing.B) { | ||
numPeers := 10 | ||
peerConnectionCount := uint32(10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of copying + pasting the whole benchmark, abstract common logic out into a function, and then call it with different args for the peerConnectionCount
and maybe even numPeers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
}{ | ||
// the higher `peerConnectionCount` is, the smoother the impact of uneven scores | ||
// become as we are random sampling among `peerConnectionCount` peers | ||
{numPeers: 10, peerConnectionCount: 1, distMin: 1000, distMax: 1000}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add peerConnectionCount: 2
, i imagine this will be a pretty small value normally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed as the impact of it get's smaller with every extra connection. Added the case for 2
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed as the impact of it get's smaller with every extra connection. Added the case for 2
!
for _, tc := range testCases { | ||
// Selected is a map from rank -> [peer, count] | ||
// It tracks how often a peer gets selected at a specific rank. | ||
selected := make([]map[string]int, tc.numPeers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you make a slice here, why can't we just create a singel map in the loop for numIterations
, and do the testDistribution
call right there too? Makes 3 loops 1 loop, removes a slice, and simplifies the test a little.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if I follow. This is a similar test as TestPeerSelectionRanking
where we are checking the distribution of the ranking after numIterations
which means we have to do the checking outside of the loop. Or am I missing something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if I follow. This is a similar test as TestPeerSelectionRanking
where we are checking the distribution of the ranking after numIterations
which means we have to do the checking outside of the loop. Or am I missing something here?
peer_test.go
Outdated
b.ResetTimer() | ||
|
||
for i := 0; i < b.N; i++ { | ||
peer, _ := ch.Peers().Get(nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should think about testing the impact of passing a non-nil blacklist with peerConnectionCount
> 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean if prevSelected
actually has some selected peers? I am assuming if a client uses peerConnectionCount > 1
, it should keep it into account that the prevSelected peers is going to be a bigger list. Or do you have another specific issue in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean if prevSelected
actually has some selected peers? I am assuming if a client uses peerConnectionCount > 1
, it should keep it into account that the prevSelected peers is going to be a bigger list. Or do you have another specific issue in mind?
Adding possibility to set `peerConnectionCount` so that a single client node can do a random load balancing among `peerConnectionCount` top nodes using the score calculator and peer heap.
f1d327f
to
f8f2139
Compare
I agree with the idea that the core should not be strategy aware. As I commented on #630 I think all the logic regarding peer connection count, score strategy, peer heap, etc should move to something like PeerSelectionStrategy. I am not sure why you think with 2 connected peers we won't need this core change and also get a better load distribution. This change is basically letting the client decide how many connected peers they want. So in cases, like ours, were we have a very low number of server/client nodes, it might actually be helpful to use 5 connected peers as you can verify with the stats I posted above. Let me know your thoughts. |
This closes #630
I have also verified the change for a
client => server
with low number of client/server nodes, where the imbalance was more severe (note that here I usedpeerConnectionCount
>=numPeers=5
, which is the extreme case but it's enough to verify the solution):