-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor routing logic to better handle failures #259
Conversation
// first remove from the load balancer, to prevent concurrent threads from making connections to them | ||
readers.remove( address ); | ||
writers.remove( address ); | ||
routers.remove( address ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if removing the address from routers
really is a good idea.
forget(...)
is called when a connection fails, and during routing we don't remove the address from the set of routers, perhaps we shouldn't in the general case of connection failure either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thobe If cluster told you about these fellows, why are you so keen to forget them?
If they are not doing their job, send them to the back of the queue.
But they just might come in handy later.
And cluster will update you with a full view some time in future, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I. e. s/forget/rank
The current "forget" behaviour grew out of a number of sessions working with the core-edge team. We shouldn't revert this without talking with them again and revisiting the reasons for making it this way. |
{ | ||
return clusterView.routingServers(); | ||
// TODO: the tests that use this should be testing for effect instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could probably just remove these methods, or do you want them in order to remember to update the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test in question is RoutingDriverStubTest
. Given the changes I've made, that test seems more interesting than the current RoutingDriverTest
, but:
- It is ignored.
- It makes assertions on internal state rather than behaviour.
Given that state, I didn't update that test. Since I didn't touch that test as all, I left these methods so that the code still compiles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is ignored just because of current TC limitations. We still want all those tests to run locally, Zhen enabled them just the other day but had to revert because of boltkit issues. The plan is still that those tests should be unignored asap, but I guess we can fix that in a separate PR.
Nice improvements! |
{ | ||
connections.purge( address ); | ||
clusterView.remove(address); | ||
this.loadBalancer = new LoadBalancer( new RoutingSettings( 10, 5_000 ), clock, log, connections, seedAddress ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make these (10
and 5000
) configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
|
||
import org.neo4j.driver.internal.net.BoltServerAddress; | ||
|
||
class RoundRobinAddressSet extends AtomicInteger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't it just have an AtomicInteger
instead of extending, composition-over-inheritance and all that. A RoundRobinAddressSet
isn't really an integer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I know. Java just sucks so badly though. It extends to avoid indirection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I wanted to do was:
abstract class RoundRobinAddressSet {
public static RoundRobinAddressSet newRoundRobinAddressSet() {
return new Implementation();
}
public abstract void BoltServerAddress next();
public abstract void update( Set<BoltServerAddress> addresses, Set<BoltServerAddress> removed );
public abstract void remove( BoltServerAddress address );
}
abstract class PaddingA extends RoundRobinAddressSet {
public long a0, a1, a2, a3, a4, a5, a6, a7;
}
abstract class TheArray extends PaddingA {
volatile BoltServerAddress[] addresses = NONE;
}
abstract class PaddingB extends TheArray {
public long b0, b1, b2, b3, b4, b5, b6, b7;
}
abstract class TheOffset extends PaddingB {
volatile int offset;
}
abstract class PaddingC extends TheOffset {
public long c0, c1, c2, c3, c4, c5, c6, c7;
}
final class Implementation extends PaddingC {
public void BoltServerAddress next() {
...
}
public synchronized void update( Set<BoltServerAddress> addresses, Set<BoltServerAddress> removed ) {
...
}
public synchronized void remove( BoltServerAddress address ) {
...
}
}
I opted for extending AtomicInteger instead, in the hope that future JDKs will have that class properly padded already, allowing me to avoid all that boilerplate.
Perhaps I should just punch out that code instead. Or maybe this doesn't matter at all, since this isn't the hot part of our code. I just reached for the standard stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it.
return addresses.length; | ||
} | ||
|
||
public BoltServerAddress next() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of liked that it used to have a name that wasn't next
- although hop
maybe wasn't ideal - the behaviour of the method is quite different from Iterator.next
so having it named differently makes that more obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hop
was just confusing. But I'm open to suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it could be something like iterate
, step
or something but come to think of it I think next
is probably better.
935b03d
to
a980fb5
Compare
Separate out different parts of the logic into different classes: * When to update the view of the cluster, and the maintenance of the set of servers in the cluster: - LoadBalancer * The ability to retrieve a list of servers with roles (ClusterComposition) from a server: - ClusterComposition.Provider * Round Robin cycling through a set of addresses (maintaining the order/position on addition/removal of members in the set): - RoundRobinAddressSet The new logic distinguishes being unable to connect to the routing servers and maintains the server in the list in this case, under the assumption that the server is only temporarily unavailable. The attempt to contact routing servers only fail after a number of failed attempts with an exponential back off in between. These changes introduce event based testing for some of the components, this makes it easier to test for the behaviour of units, rather than asserting on their state, while the explicitness of it makes it easier to follow than using mocks.
a980fb5
to
28a3c3f
Compare
@nigelsmall I changed the |
Separate out different parts of the logic into different classes:
The new logic distinguishes being unable to connect to the routing servers and maintains the server in the list in this case, under the assumption that the server is only temporarily unavailable. The attempt to contact routing servers only fail after a number of failed attempts with an exponential back off in between.
These changes introduce event based testing for some of the components, this makes it easier to test for the behaviour of units, rather than asserting on their state, while the explicitness of it makes it easier to follow than using mocks.