TransportService.connectToNode should validate remote node ID #22828

bleskes · 2017-01-27T09:49:35Z

#22194 gave us the ability to open low level temporary connections to remote node based on their address. With this use case out of the way, actual full blown connections should validate the node on the other side, making sure we speak to who we think we speak to. This helps in case where multiple nodes are started on the same host and a quick node restart causes them to swap addresses, which in turn can cause confusion down the road.

…ests until ready

…e_on_connect

bleskes · 2017-02-06T20:16:16Z

@s1monw I rebased this to include #22984 and addressed your feedback. Can you take another look?

bleskes · 2017-02-06T21:16:53Z

test this please

…e_on_connect

s1monw

left some suggestions no blockers but I'd love them to be addressed.... no need for another review

s1monw · 2017-02-07T09:03:24Z

core/src/main/java/org/elasticsearch/client/transport/TransportClientNodesService.java

+
+                        /**
+                         * we try to reuse existing connections but if needed we will open a temporary connection
+                         * that nodes to be closed at the end of execution.


s/that nodes to be closed at the end of execution./to a node that will be closed at the end of the excution?

s1monw · 2017-02-07T09:05:57Z

core/src/main/java/org/elasticsearch/client/transport/TransportClientNodesService.java

+                                        logger.info(
+                                            (Supplier<?>) () -> new ParameterizedMessage(
+                                                "failed to get local cluster state for {}, disconnecting...", nodeToPing), e);
+                                        latch.countDown();


I am not sure but should we count down the latch once we notified the listener? I think it would be the better semantic.... maybe in a finally block?

yeah, I tried to avoid the extra bloat of a finally block, but I agree.

s1monw · 2017-02-07T09:06:28Z

core/src/main/java/org/elasticsearch/discovery/zen/ZenDiscovery.java

@@ -113,6 +114,8 @@
    private final NodesFaultDetection nodesFD;
    private final PublishClusterStateAction publishClusterState;
    private final MembershipAction membership;
+    private final ThreadPool threadPool;
+


extra newline?

s1monw · 2017-02-07T09:07:23Z

core/src/main/java/org/elasticsearch/discovery/zen/ZenDiscovery.java

-            } catch (Exception e) {
-                logger.warn((Supplier<?>) () -> new ParameterizedMessage("failed to send rejoin request to [{}]", otherMaster), e);
-            }
+            // spawn to a background thread to not do blocking operations on


// spawn to a background thread to not do blocking operations on .. on? on what?

spawn to a background thread to not do blocking operations on the cluster state thread

thx.

s1monw · 2017-02-07T09:08:03Z

core/src/main/java/org/elasticsearch/transport/ConnectionProfile.java

@@ -79,6 +79,19 @@ private ConnectionProfile(List<ConnectionTypeHandle> handles, int numConnections
        private TimeValue connectTimeout;
        private TimeValue handshakeTimeout;

+        /** create an empty builder */
+        public Builder() {
+


extra newline?

s1monw · 2017-02-07T09:10:42Z

core/src/main/java/org/elasticsearch/transport/TcpTransport.java

                } catch (ConnectTransportException e) {
                    throw e;
                } catch (Exception e) {
                    throw new ConnectTransportException(node, "general node connection failure", e);
+                } finally {
+                    if (success == false) {
+                        connectedNodes.remove(node);


I think you can move this to the inner catch then you don't need to remove it from the connectedNodes....

yeah, that's a good one. I made it such so that exceptions on the transportServiceAdapter.onNodeConnected(node); will mean an unsuccessful connection, as will any exception that bubbles out of this method.

I think that's pretty much illegal it should not throw any and it should be handled internally

s1monw · 2017-02-07T09:13:38Z

core/src/test/java/org/elasticsearch/transport/TransportServiceTests.java

+
+import org.elasticsearch.test.ESTestCase;
+
+public class TransportServiceTests extends ESTestCase {


hmm what is this? :D you just wanna add some lines of code I guess 💃

hehe. I started to add a test and then noticed that AbstractSimpleTransportTestCase has all the infra. Will clean this up.

s1monw · 2017-02-07T09:14:31Z

core/src/main/java/org/elasticsearch/transport/Transport.java

     */
-    void connectToNode(DiscoveryNode node, ConnectionProfile connectionProfile) throws ConnectTransportException;
+    void connectToNode(DiscoveryNode node, ConnectionProfile connectionProfile,
+                       CheckedBiConsumer<Connection, ConnectionProfile, IOException> connectionValidator) throws ConnectTransportException;


this must be non-null correct?

yeah, I think this deep in the infra it's better to have clarity? it's easy enough to pass a no-op from tests and it production code we always pass something.

…e_on_connect

…en processed and the NodeConnectionService connect to the new master This removes the need to explicitly connect to the master, which triggers an assertion due to the blocking operation on the cluster state thread. Relates to elastic#22828

…en processed and the NodeConnectionService connect to the new master (#23037) After the first cluster state from a new master is processed, NodeConnectionService guarantees we connect to the new master. This removes the need to explicitly connect to the master in the MasterFaultDetection code making it simpler and bypasses the assertion triggered due to the blocking operation on the cluster state thread. Relates to #22828

When a node receives a new cluster state from the master, it opens up connections to any new node in the cluster state. That has always been done serially on the cluster state thread but it has been a long standing TODO to do this concurrently, which is done by this PR. This is spin off of #22828, where an extra handshake is done whenever connecting to a node, which may slow down connecting. Also, the handshake is done in a blocking fashion which triggers assertions w.r.t blocking requests on the cluster state thread. Instead of adding an exception, I opted to implement concurrent connections which both side steps the assertion and compensates for the extra handshake.

#22194 gave us the ability to open low level temporary connections to remote node based on their address. With this use case out of the way, actual full blown connections should validate the node on the other side, making sure we speak to who we think we speak to. This helps in case where multiple nodes are started on the same host and a quick node restart causes them to swap addresses, which in turn can cause confusion down the road.

…TestCase & Netty3ScheduledPingTests broken by backport of #22828

…en processed and the NodeConnectionService connect to the new master (#23037) After the first cluster state from a new master is processed, NodeConnectionService guarantees we connect to the new master. This removes the need to explicitly connect to the master in the MasterFaultDetection code making it simpler and bypasses the assertion triggered due to the blocking operation on the cluster state thread. Relates to #22828

bleskes added 25 commits January 25, 2017 13:06

Always handshake on Transport#connectToNode

7110121

better handling of timeout piping

b9ff025

lint

3b2dda0

one more connection resolving

1a9d05c

better exception class

567ab19

rewrote client connection handling. Fingers crossed.

84e2c77

fix TaskManagerTestCase

2afc9d3

fix TransportClientHeadersTests

248045e

liveness response always contains local node - we block incoming requ…

af03e24

…ests until ready

fix testNodeConnectWithDifferentNodeId

8bb0fbc

fix PublishClusterStateActionTests

35107e3

onResponseSent is not called on local node

627dcec

fix TransportActionProxyTests

8e40796

doh

51b2ea7

concurrent connect on another thread

2096317

tribes

efd6ea4

don't send rejoin on CS thread (we connect to node)

7810c4f

Merge remote-tracking branch 'upstream/master' into transport_validat…

a864ab6

…e_on_connect

linting

e7c1260

fix AbstractSimpleTransportTestCase

48b7fd2

test for connection profile resolving

5bebb32

Fix new networking tests

5a5f2ef

linting

23e1637

fix Netty4ScheduledPingTests

c215478

better handling of connection closing in client

26c17bc

bleskes added :Distributed Coordination/Network Http and internode communication implementations >enhancement resiliency v5.3.0 v6.0.0-alpha1 labels Jan 27, 2017

bleskes added 4 commits February 6, 2017 20:36

feedback

5f9b5ac

Merge remote-tracking branch 'upstream/master' into transport_validat…

3c2ee90

…e_on_connect

fix CancellableTasksTests

b74dbfb

handshake response back to public

402da3b

bleskes added 2 commits February 7, 2017 08:38

add thread interrupt flag

71ea969

Merge remote-tracking branch 'upstream/master' into transport_validat…

24dd6b3

…e_on_connect

s1monw approved these changes Feb 7, 2017

View reviewed changes

bleskes added 4 commits February 7, 2017 11:05

feedback

401c332

Merge remote-tracking branch 'upstream/master' into transport_validat…

533b245

…e_on_connect

assert busy in testAdapterSendReceiveCallbacks

2cc9f66

move connection close to inner try

6aabf6b

clintongormley added v5.4.0 and removed v5.3.0 labels Feb 7, 2017

bleskes merged commit ba06c14 into elastic:master Feb 7, 2017

bleskes deleted the transport_validate_on_connect branch February 7, 2017 20:11

bleskes mentioned this pull request Feb 8, 2017

MasterFaultDetection can start after the initial cluster state has been processed #23037

Merged

bleskes added a commit that referenced this pull request Feb 9, 2017

Adapt LocalTransport & fix TCPTransportTests, AbstractSimpleTransport…

b162ed8

…TestCase & Netty3ScheduledPingTests broken by backport of #22828

tlrx mentioned this pull request May 11, 2017

SniffNodesSampler should close connection after handling responses #24632

Merged

obourgain mentioned this pull request Sep 7, 2017

Close the light connection when removing an address from TransportClient #26505

Closed

tlrx mentioned this pull request Oct 9, 2017

5.4.0 transport client failed to get local cluster state while using 5.3.0 to connect to 5.4.0 servers works #24575

Closed

javanna mentioned this pull request Nov 3, 2017

Idle elastic java low level client continuously create new TCP socks #27220

Closed

bleskes mentioned this pull request Apr 25, 2018

Make Transport Client load balancer friendly #30141

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TransportService.connectToNode should validate remote node ID #22828

TransportService.connectToNode should validate remote node ID #22828

bleskes commented Jan 27, 2017

bleskes commented Feb 6, 2017

bleskes commented Feb 6, 2017

s1monw left a comment

s1monw Feb 7, 2017

bleskes Feb 7, 2017

s1monw Feb 7, 2017

bleskes Feb 7, 2017

s1monw Feb 7, 2017

s1monw Feb 7, 2017

bleskes Feb 7, 2017

s1monw Feb 7, 2017

s1monw Feb 7, 2017

bleskes Feb 7, 2017

s1monw Feb 7, 2017

s1monw Feb 7, 2017

bleskes Feb 7, 2017

s1monw Feb 7, 2017

bleskes Feb 7, 2017


		import org.elasticsearch.test.ESTestCase;

		public class TransportServiceTests extends ESTestCase {

TransportService.connectToNode should validate remote node ID #22828

TransportService.connectToNode should validate remote node ID #22828

Conversation

bleskes commented Jan 27, 2017

bleskes commented Feb 6, 2017

bleskes commented Feb 6, 2017

s1monw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment