-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TransportService.connectToNode should validate remote node ID #22828
TransportService.connectToNode should validate remote node ID #22828
Conversation
test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some suggestions no blockers but I'd love them to be addressed.... no need for another review
|
||
/** | ||
* we try to reuse existing connections but if needed we will open a temporary connection | ||
* that nodes to be closed at the end of execution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/that nodes to be closed at the end of execution./to a node that will be closed at the end of the excution
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure.
logger.info( | ||
(Supplier<?>) () -> new ParameterizedMessage( | ||
"failed to get local cluster state for {}, disconnecting...", nodeToPing), e); | ||
latch.countDown(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure but should we count down the latch once we notified the listener? I think it would be the better semantic.... maybe in a finally block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I tried to avoid the extra bloat of a finally block, but I agree.
@@ -113,6 +114,8 @@ | |||
private final NodesFaultDetection nodesFD; | |||
private final PublishClusterStateAction publishClusterState; | |||
private final MembershipAction membership; | |||
private final ThreadPool threadPool; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra newline?
} catch (Exception e) { | ||
logger.warn((Supplier<?>) () -> new ParameterizedMessage("failed to send rejoin request to [{}]", otherMaster), e); | ||
} | ||
// spawn to a background thread to not do blocking operations on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// spawn to a background thread to not do blocking operations on
.. on? on what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spawn to a background thread to not do blocking operations on the cluster state thread
thx.
@@ -79,6 +79,19 @@ private ConnectionProfile(List<ConnectionTypeHandle> handles, int numConnections | |||
private TimeValue connectTimeout; | |||
private TimeValue handshakeTimeout; | |||
|
|||
/** create an empty builder */ | |||
public Builder() { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra newline?
} catch (ConnectTransportException e) { | ||
throw e; | ||
} catch (Exception e) { | ||
throw new ConnectTransportException(node, "general node connection failure", e); | ||
} finally { | ||
if (success == false) { | ||
connectedNodes.remove(node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can move this to the inner catch then you don't need to remove it from the connectedNodes....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that's a good one. I made it such so that exceptions on the transportServiceAdapter.onNodeConnected(node);
will mean an unsuccessful connection, as will any exception that bubbles out of this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's pretty much illegal it should not throw any and it should be handled internally
|
||
import org.elasticsearch.test.ESTestCase; | ||
|
||
public class TransportServiceTests extends ESTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm what is this? :D you just wanna add some lines of code I guess 💃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hehe. I started to add a test and then noticed that AbstractSimpleTransportTestCase
has all the infra. Will clean this up.
*/ | ||
void connectToNode(DiscoveryNode node, ConnectionProfile connectionProfile) throws ConnectTransportException; | ||
void connectToNode(DiscoveryNode node, ConnectionProfile connectionProfile, | ||
CheckedBiConsumer<Connection, ConnectionProfile, IOException> connectionValidator) throws ConnectTransportException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this must be non-null correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I think this deep in the infra it's better to have clarity? it's easy enough to pass a no-op from tests and it production code we always pass something.
…en processed and the NodeConnectionService connect to the new master This removes the need to explicitly connect to the master, which triggers an assertion due to the blocking operation on the cluster state thread. Relates to elastic#22828
…en processed and the NodeConnectionService connect to the new master (#23037) After the first cluster state from a new master is processed, NodeConnectionService guarantees we connect to the new master. This removes the need to explicitly connect to the master in the MasterFaultDetection code making it simpler and bypasses the assertion triggered due to the blocking operation on the cluster state thread. Relates to #22828
When a node receives a new cluster state from the master, it opens up connections to any new node in the cluster state. That has always been done serially on the cluster state thread but it has been a long standing TODO to do this concurrently, which is done by this PR. This is spin off of #22828, where an extra handshake is done whenever connecting to a node, which may slow down connecting. Also, the handshake is done in a blocking fashion which triggers assertions w.r.t blocking requests on the cluster state thread. Instead of adding an exception, I opted to implement concurrent connections which both side steps the assertion and compensates for the extra handshake.
#22194 gave us the ability to open low level temporary connections to remote node based on their address. With this use case out of the way, actual full blown connections should validate the node on the other side, making sure we speak to who we think we speak to. This helps in case where multiple nodes are started on the same host and a quick node restart causes them to swap addresses, which in turn can cause confusion down the road.
…TestCase & Netty3ScheduledPingTests broken by backport of #22828
…en processed and the NodeConnectionService connect to the new master (#23037) After the first cluster state from a new master is processed, NodeConnectionService guarantees we connect to the new master. This removes the need to explicitly connect to the master in the MasterFaultDetection code making it simpler and bypasses the assertion triggered due to the blocking operation on the cluster state thread. Relates to #22828
#22194 gave us the ability to open low level temporary connections to remote node based on their address. With this use case out of the way, actual full blown connections should validate the node on the other side, making sure we speak to who we think we speak to. This helps in case where multiple nodes are started on the same host and a quick node restart causes them to swap addresses, which in turn can cause confusion down the road.