Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor routing logic to better handle failures #259

Merged
merged 1 commit into from
Oct 26, 2016

Conversation

thobe
Copy link
Contributor

@thobe thobe commented Oct 24, 2016

Separate out different parts of the logic into different classes:

  • When to update the view of the cluster, and the maintenance of the set of servers in the cluster:
    • LoadBalancer
  • The ability to retrieve a list of servers with roles (ClusterComposition) from a server:
    • ClusterComposition.Provider
  • Round Robin cycling through a set of addresses (maintaining the order/position on addition/removal of members in the set):
    • RoundRobinAddressSet

The new logic distinguishes being unable to connect to the routing servers and maintains the server in the list in this case, under the assumption that the server is only temporarily unavailable. The attempt to contact routing servers only fail after a number of failed attempts with an exponential back off in between.

These changes introduce event based testing for some of the components, this makes it easier to test for the behaviour of units, rather than asserting on their state, while the explicitness of it makes it easier to follow than using mocks.

// first remove from the load balancer, to prevent concurrent threads from making connections to them
readers.remove( address );
writers.remove( address );
routers.remove( address );
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if removing the address from routers really is a good idea.
forget(...) is called when a connection fails, and during routing we don't remove the address from the set of routers, perhaps we shouldn't in the general case of connection failure either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thobe If cluster told you about these fellows, why are you so keen to forget them?

If they are not doing their job, send them to the back of the queue.

But they just might come in handy later.

And cluster will update you with a full view some time in future, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I. e. s/forget/rank

@technige
Copy link
Contributor

technige commented Oct 24, 2016

The current "forget" behaviour grew out of a number of sessions working with the core-edge team. We shouldn't revert this without talking with them again and revisiting the reasons for making it this way.

{
return clusterView.routingServers();
// TODO: the tests that use this should be testing for effect instead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably just remove these methods, or do you want them in order to remember to update the tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test in question is RoutingDriverStubTest. Given the changes I've made, that test seems more interesting than the current RoutingDriverTest, but:

  1. It is ignored.
  2. It makes assertions on internal state rather than behaviour.

Given that state, I didn't update that test. Since I didn't touch that test as all, I left these methods so that the code still compiles.

Copy link
Contributor

@pontusmelke pontusmelke Oct 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is ignored just because of current TC limitations. We still want all those tests to run locally, Zhen enabled them just the other day but had to revert because of boltkit issues. The plan is still that those tests should be unignored asap, but I guess we can fix that in a separate PR.

@pontusmelke
Copy link
Contributor

Nice improvements!

{
connections.purge( address );
clusterView.remove(address);
this.loadBalancer = new LoadBalancer( new RoutingSettings( 10, 5_000 ), clock, log, connections, seedAddress );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make these (10 and 5000) configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we should.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


import org.neo4j.driver.internal.net.BoltServerAddress;

class RoundRobinAddressSet extends AtomicInteger
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't it just have an AtomicInteger instead of extending, composition-over-inheritance and all that. A RoundRobinAddressSet isn't really an integer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I know. Java just sucks so badly though. It extends to avoid indirection.

Copy link
Contributor Author

@thobe thobe Oct 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I wanted to do was:

abstract class RoundRobinAddressSet {
    public static RoundRobinAddressSet newRoundRobinAddressSet() {
        return new Implementation();
    }
    public abstract void BoltServerAddress next();
    public abstract void update( Set<BoltServerAddress> addresses, Set<BoltServerAddress> removed );
    public abstract void remove( BoltServerAddress address );
}
abstract class PaddingA extends RoundRobinAddressSet {
    public long a0, a1, a2, a3, a4, a5, a6, a7;
}
abstract class TheArray extends PaddingA {
    volatile BoltServerAddress[] addresses = NONE;
}
abstract class PaddingB extends TheArray {
    public long b0, b1, b2, b3, b4, b5, b6, b7;
}
abstract class TheOffset extends PaddingB {
    volatile int offset;
}
abstract class PaddingC extends TheOffset {
    public long c0, c1, c2, c3, c4, c5, c6, c7;
}
final class Implementation extends PaddingC {
    public void BoltServerAddress next() {
        ...
    }
    public synchronized void update( Set<BoltServerAddress> addresses, Set<BoltServerAddress> removed ) {
        ...
    }
    public synchronized void remove( BoltServerAddress address ) {
        ...
    }
}

I opted for extending AtomicInteger instead, in the hope that future JDKs will have that class properly padded already, allowing me to avoid all that boilerplate.

Perhaps I should just punch out that code instead. Or maybe this doesn't matter at all, since this isn't the hot part of our code. I just reached for the standard stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it.

return addresses.length;
}

public BoltServerAddress next()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of liked that it used to have a name that wasn't next - although hop maybe wasn't ideal - the behaviour of the method is quite different from Iterator.next so having it named differently makes that more obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hop was just confusing. But I'm open to suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could be something like iterate, step or something but come to think of it I think next is probably better.

@thobe thobe force-pushed the 1.1-routing-refactor branch from 935b03d to a980fb5 Compare October 25, 2016 11:14
Separate out different parts of the logic into different classes:
 * When to update the view of the cluster, and the maintenance of the
   set of servers in the cluster:
   - LoadBalancer
 * The ability to retrieve a list of servers with roles
   (ClusterComposition) from a server:
   - ClusterComposition.Provider
 * Round Robin cycling through a set of addresses (maintaining the
   order/position on addition/removal of members in the set):
  - RoundRobinAddressSet

The new logic distinguishes being unable to connect to the routing
servers and maintains the server in the list in this case, under the
assumption that the server is only temporarily unavailable. The
attempt to contact routing servers only fail after a number of failed
attempts with an exponential back off in between.

These changes introduce event based testing for some of the components,
this makes it easier to test for the behaviour of units, rather than
asserting on their state, while the explicitness of it makes it easier
to follow than using mocks.
@thobe thobe force-pushed the 1.1-routing-refactor branch from a980fb5 to 28a3c3f Compare October 25, 2016 11:36
@thobe
Copy link
Contributor Author

thobe commented Oct 25, 2016

@nigelsmall I changed the forget() behaviour slightly, and added a test case: https://github.com/neo4j/neo4j-java-driver/pull/259/files#diff-aa4312a82350e77851bd1d009952fea9R546

@pontusmelke pontusmelke merged commit 9dcbbf9 into neo4j:1.1 Oct 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants