5.4.0 transport client failed to get local cluster state while using 5.3.0 to connect to 5.4.0 servers works #24575

fangqiao · 2017-05-10T09:05:57Z

Hi, I have a rest service using Netty as basis and connecting to ElasticSearch backend via java transport client API.
It worked very well with Netty 4.1.8 and ES 5.3.0.
Now I tried to upgrade ES backend and transport client to 5.4.0, and also Netty to 4.1.9. Then following problems happened:

10 May 2017;17:01:59.645 Developer linux-68qh [elasticsearch[client][generic][T#3]] INFO o.e.c.t.TransportClientNodesService - failed to get local cluster state for {#transport#-1}{WlTQjgcGQ1uqyNNsw4ZnAw}{127.0.0.1}{127.0.0.1:9300}, disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][127.0.0.1:9300][cluster:monitor/state] request_id [7] timed out after [5001ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:925)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I roll back the transport client to 5.3.0 but keep backend 5.4.0.

Then it is able to connect to Es backend.
I use SBT and the build dependencies for the error are:

"io.netty" % "netty-all" % "4.1.9.Final"
"org.elasticsearch" % "elasticsearch" % "5.4.0"
"org.elasticsearch.client" % "transport" % "5.4.0",
and "io.netty" % "netty-transport-native-epoll" % "4.1.9.Final" classifier "linux-x86_64"

Environment:

openjdk version "1.8.0_121"
OpenJDK Runtime Environment (IcedTea 3.3.0) (suse-3.3-x86_64)
OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)

Linux linux-68qh 4.10.13-1-default #1 SMP PREEMPT Thu Apr 27 12:23:31 UTC 2017 (e5d11ce) x86_64 x86_64 x86_64 GNU/Linux

Thanks

tlrx · 2017-05-11T08:47:54Z

It looks like a bug to me. Is sniffing enabled on your transport client?

fangqiao · 2017-05-11T09:57:09Z

Yes it is enabled.

jloisel · 2017-05-11T18:50:04Z

Same issue here. We have:

Spring Boot v1.5.3.RELEASE,
Switched from Elasticsearch 5.3.2 to 5.4.0,
using Transport Client with sniff enabled.

Client and Elasticsearch both on the same machine, connecting through localhost:

When using TransportClient 5.3.2 to connect to Elastic 5.4.0 => OK,
5.4.0 to 5.4.0 => KO.

The exception we have on startup:

org.elasticsearch.transport.ReceiveTimeoutTransportException: [][127.0.0.1:9300][cluster:monitor/state] request_id [7] timed out after [5000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:925) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.4.0.jar:5.4.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131]

nwadams · 2017-05-11T20:36:44Z

I am seeing this issue as well on some nodes connecting to ES. We run a service that has multiple machines that each connect to ES, some of them are able to connect successfully and others do not.

tlrx · 2017-05-11T20:44:03Z

Thanks for reporting, I think I know where the issue is.

With the current implementation, SniffNodesSampler might close the current connection right after a request is sent but before the response is correctly handled. This causes to timeouts in the transport client when the sniffing is activated. closes elastic#24575 closes elastic#24557

nwadams · 2017-05-11T20:51:05Z

Thanks @tlrx. I'm not sure if you are also aware, but I also saw errors that looked like the following when I disabled sniffing.

20:38:44.935 [elasticsearch[_client_][generic][T#2]] DEBUG - failed to connect to discovered node [{i-0562d98cb14e42358}{Gzbd-MEzRo-OHMUoEajvXA}{x6V2--f3SS-NzVk5wAQQYg}{10.178.212.242}{127.0.0.1:4374}{aws_availability_zone=us-east-1a}]
ConnectTransportException[[i-0562d98cb14e42358][127.0.0.1:4374] handshake failed. unexpected remote node {i-01bae8d9b0f31ac54}{MUjAv_3JR5KmzEdn-eJeSA}{qJdTT_oaSRCJ1TLO1W2A6w}{10.158.100.27}{10.158.100.27:9300}{aws_availability_zone=us-east-1b}]
	at org.elasticsearch.transport.TransportService.lambda$connectToNode$3(TransportService.java:319)
	at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:466)
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:315)
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:302)
	at org.elasticsearch.client.transport.TransportClientNodesService$NodeSampler.validateNewNodes(TransportClientNodesService.java:374)
	at org.elasticsearch.client.transport.TransportClientNodesService$SimpleNodeSampler.doSample(TransportClientNodesService.java:442)
	at org.elasticsearch.client.transport.TransportClientNodesService$NodeSampler.sample(TransportClientNodesService.java:358)
	at org.elasticsearch.client.transport.TransportClientNodesService$ScheduledNodeSampler.run(TransportClientNodesService.java:391)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

If it helps we have a service discovery framework to discover services: (https://medium.com/airbnb-engineering/smartstack-service-discovery-in-the-cloud-4b8a080de619). We "randomly" pick an ES box to connect to and then use sniffing (if enabled) to discover the rest. Even though ES is running on 9200/9300 we use a different port on our client machines because of the service discovery framework does the correct routing. Both the service discovery port and the "direct access" port are reachable over the network.

I am rolling back our transport client version to 5.3.2 and will report back on the results.
Update: 5.3.2 works great

…port handlers Today we prune transport handlers in TransporService when a node is disconnected. This can cause connections to starve in the TransportService if the connection is opened as a short living connection ie. without sharing the connection to a node via registering in the transport itself. This change now moves to pruning based on the connections cache key to ensure we notify handlers as soon as the connection is closed for all connections not just for registered connections. Relates to elastic#24632 Relates to elastic#24575 Relates to elastic#24557

githubdoramon · 2017-05-12T11:03:32Z

Same here... 5.4.0 to 5.4.0 fails.... but 5.3.0 to 5.4.0 works

…port handlers (#24639) Today we prune transport handlers in TransportService when a node is disconnected. This can cause connections to starve in the TransportService if the connection is opened as a short living connection ie. without sharing the connection to a node via registering in the transport itself. This change now moves to pruning based on the connections cache key to ensure we notify handlers as soon as the connection is closed for all connections not just for registered connections. Relates to #24632 Relates to #24575 Relates to #24557

…24632) With the current implementation, SniffNodesSampler might close the current connection right after a request is sent but before the response is correctly handled. This causes to timeouts in the transport client when the sniffing is activated. closes #24575 closes #24557

jinnah79 · 2017-05-29T12:30:20Z

Same here... 5.4.0 to 5.4.0 fails.... but 5.3.0 to 5.4.0 works

debraj-manna · 2017-10-06T15:54:42Z

I am seeing a similar exception in 2.3.1. Below is the exception:-

INFO [2017-08-08 20:14:18,019] [U:3,129,F:822,T:3,950,M:3,950] elasticsearch.client.transport:[TransportClientNodesService$SniffNodesSampler$1$1:handleException:455] - [elasticsearch[Edward "Ned" Buckman][generic][T#61]] - [Edward "Ned" Buckman] failed to get local cluster state for {#transport#-1}{127.0.0.1}{localhost/127.0.0.1:9300}, disconnecting...
ReceiveTimeoutTransportException[[][localhost/127.0.0.1:9300][cluster:monitor/state] request_id [341654] timed out after [5001ms]]
	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:679)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Is the issue not fixed in 2.3.1?

tlrx · 2017-10-09T09:10:40Z

Is the issue not fixed in 2.3.1?

I didn't test in 2.3.1 since the fix fixed a bug introduced in #22828 for 5.4.0. It's possible that this bug exists in 2.3.1 but this version is EOL and not supported anymore.

debraj-manna · 2017-10-09T09:15:05Z

ok thanks for the update. Sent from GMail on Android

…

On Oct 9, 2017 2:42 PM, "Tanguy Leroux" ***@***.***> wrote: Is the issue not fixed in 2.3.1? I didn't test in 2.3.1 since the fix fixed a bug introduced in #22828 <#22828> for 5.4.0. It's possible that this bug exists in 2.3.1 but this version is EOL and not supported anymore. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#24575 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHw8JPzLkGBjp7CpN19QJ6_lhFvamNlLks5sqeOCgaJpZM4NWYB_> .

tlrx added :Distributed Coordination/Network Http and internode communication implementations >bug v5.4.0 labels May 11, 2017

tlrx added the feedback_needed label May 11, 2017

tlrx mentioned this issue May 11, 2017

SniffNodesSampler should close connection after handling responses #24632

Merged

s1monw mentioned this issue May 12, 2017

Notify onConnectionClosed rather than onNodeDisconnect to prune transport handlers #24639

Merged

tlrx closed this as completed in #24632 May 12, 2017

andreausu mentioned this issue Jun 15, 2017

New 5.4.1 cluster on EC2 fails with "handshake failed. unexpected remote node" #25259

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

5.4.0 transport client failed to get local cluster state while using 5.3.0 to connect to 5.4.0 servers works #24575

5.4.0 transport client failed to get local cluster state while using 5.3.0 to connect to 5.4.0 servers works #24575

fangqiao commented May 10, 2017 •

edited

Loading

tlrx commented May 11, 2017

fangqiao commented May 11, 2017

jloisel commented May 11, 2017

nwadams commented May 11, 2017

tlrx commented May 11, 2017

nwadams commented May 11, 2017 •

edited

Loading

githubdoramon commented May 12, 2017

jinnah79 commented May 29, 2017

debraj-manna commented Oct 6, 2017

tlrx commented Oct 9, 2017

debraj-manna commented Oct 9, 2017 via email

5.4.0 transport client failed to get local cluster state while using 5.3.0 to connect to 5.4.0 servers works #24575

5.4.0 transport client failed to get local cluster state while using 5.3.0 to connect to 5.4.0 servers works #24575

Comments

fangqiao commented May 10, 2017 • edited Loading

tlrx commented May 11, 2017

fangqiao commented May 11, 2017

jloisel commented May 11, 2017

nwadams commented May 11, 2017

tlrx commented May 11, 2017

nwadams commented May 11, 2017 • edited Loading

githubdoramon commented May 12, 2017

jinnah79 commented May 29, 2017

debraj-manna commented Oct 6, 2017

tlrx commented Oct 9, 2017

debraj-manna commented Oct 9, 2017 via email

fangqiao commented May 10, 2017 •

edited

Loading

nwadams commented May 11, 2017 •

edited

Loading