Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TransportService.connectToNode should validate remote node ID #22828

Merged
merged 37 commits into from
Feb 7, 2017

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented Jan 27, 2017

#22194 gave us the ability to open low level temporary connections to remote node based on their address. With this use case out of the way, actual full blown connections should validate the node on the other side, making sure we speak to who we think we speak to. This helps in case where multiple nodes are started on the same host and a quick node restart causes them to swap addresses, which in turn can cause confusion down the road.

@bleskes
Copy link
Contributor Author

bleskes commented Feb 6, 2017

@s1monw I rebased this to include #22984 and addressed your feedback. Can you take another look?

@bleskes
Copy link
Contributor Author

bleskes commented Feb 6, 2017

test this please

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some suggestions no blockers but I'd love them to be addressed.... no need for another review


/**
* we try to reuse existing connections but if needed we will open a temporary connection
* that nodes to be closed at the end of execution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/that nodes to be closed at the end of execution./to a node that will be closed at the end of the excution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

logger.info(
(Supplier<?>) () -> new ParameterizedMessage(
"failed to get local cluster state for {}, disconnecting...", nodeToPing), e);
latch.countDown();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure but should we count down the latch once we notified the listener? I think it would be the better semantic.... maybe in a finally block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I tried to avoid the extra bloat of a finally block, but I agree.

@@ -113,6 +114,8 @@
private final NodesFaultDetection nodesFD;
private final PublishClusterStateAction publishClusterState;
private final MembershipAction membership;
private final ThreadPool threadPool;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra newline?

} catch (Exception e) {
logger.warn((Supplier<?>) () -> new ParameterizedMessage("failed to send rejoin request to [{}]", otherMaster), e);
}
// spawn to a background thread to not do blocking operations on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// spawn to a background thread to not do blocking operations on .. on? on what?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spawn to a background thread to not do blocking operations on the cluster state thread

thx.

@@ -79,6 +79,19 @@ private ConnectionProfile(List<ConnectionTypeHandle> handles, int numConnections
private TimeValue connectTimeout;
private TimeValue handshakeTimeout;

/** create an empty builder */
public Builder() {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra newline?

} catch (ConnectTransportException e) {
throw e;
} catch (Exception e) {
throw new ConnectTransportException(node, "general node connection failure", e);
} finally {
if (success == false) {
connectedNodes.remove(node);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can move this to the inner catch then you don't need to remove it from the connectedNodes....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's a good one. I made it such so that exceptions on the transportServiceAdapter.onNodeConnected(node); will mean an unsuccessful connection, as will any exception that bubbles out of this method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's pretty much illegal it should not throw any and it should be handled internally


import org.elasticsearch.test.ESTestCase;

public class TransportServiceTests extends ESTestCase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm what is this? :D you just wanna add some lines of code I guess 💃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hehe. I started to add a test and then noticed that AbstractSimpleTransportTestCase has all the infra. Will clean this up.

*/
void connectToNode(DiscoveryNode node, ConnectionProfile connectionProfile) throws ConnectTransportException;
void connectToNode(DiscoveryNode node, ConnectionProfile connectionProfile,
CheckedBiConsumer<Connection, ConnectionProfile, IOException> connectionValidator) throws ConnectTransportException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this must be non-null correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think this deep in the infra it's better to have clarity? it's easy enough to pass a no-op from tests and it production code we always pass something.

@bleskes bleskes merged commit ba06c14 into elastic:master Feb 7, 2017
@bleskes bleskes deleted the transport_validate_on_connect branch February 7, 2017 20:11
bleskes added a commit to bleskes/elasticsearch that referenced this pull request Feb 8, 2017
…en processed and the NodeConnectionService connect to the new master

This removes the need to explicitly connect to the master, which triggers an assertion due to the blocking operation on the cluster state thread.

Relates to elastic#22828
bleskes added a commit that referenced this pull request Feb 8, 2017
…en processed and the NodeConnectionService connect to the new master (#23037)

After the first cluster state from a new master is processed, NodeConnectionService guarantees we connect to the new master. This removes the need to explicitly connect to the master in the MasterFaultDetection code making it simpler and bypasses the assertion triggered due to the blocking operation on the cluster state thread.

Relates to #22828
bleskes added a commit that referenced this pull request Feb 9, 2017
When a node receives a new cluster state from the master, it opens up connections to any new node in the cluster state. That has always been done serially on the cluster state thread but it has been a long standing TODO to do this concurrently, which is done by this PR.

This is spin off of #22828, where an extra handshake is done whenever connecting to a node, which may slow down connecting. Also, the handshake is done in a blocking fashion which triggers assertions w.r.t blocking requests on the cluster state thread. Instead of adding an exception, I opted to implement concurrent connections which both side steps the assertion and compensates for the extra handshake.
bleskes added a commit that referenced this pull request Feb 9, 2017
 #22194 gave us the ability to open low level temporary connections to remote node based on their address. With this use case out of the way, actual full blown connections should validate the node on the other side, making sure we speak to who we think we speak to. This helps in case where multiple nodes are started on the same host and a quick node restart causes them to swap addresses, which in turn can cause confusion down the road.
bleskes added a commit that referenced this pull request Feb 9, 2017
…TestCase & Netty3ScheduledPingTests broken by backport of #22828
bleskes added a commit that referenced this pull request Feb 9, 2017
…en processed and the NodeConnectionService connect to the new master (#23037)

After the first cluster state from a new master is processed, NodeConnectionService guarantees we connect to the new master. This removes the need to explicitly connect to the master in the MasterFaultDetection code making it simpler and bypasses the assertion triggered due to the blocking operation on the cluster state thread.

Relates to #22828
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants