Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5.4.0 transport client failed to get local cluster state while using 5.3.0 to connect to 5.4.0 servers works #24575

Closed
fangqiao opened this issue May 10, 2017 · 11 comments · Fixed by #24632
Labels
>bug :Distributed Coordination/Network Http and internode communication implementations feedback_needed v5.4.0

Comments

@fangqiao
Copy link

fangqiao commented May 10, 2017

Hi, I have a rest service using Netty as basis and connecting to ElasticSearch backend via java transport client API.
It worked very well with Netty 4.1.8 and ES 5.3.0.
Now I tried to upgrade ES backend and transport client to 5.4.0, and also Netty to 4.1.9. Then following problems happened:

10 May 2017;17:01:59.645 Developer linux-68qh [elasticsearch[client][generic][T#3]] INFO o.e.c.t.TransportClientNodesService - failed to get local cluster state for {#transport#-1}{WlTQjgcGQ1uqyNNsw4ZnAw}{127.0.0.1}{127.0.0.1:9300}, disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][127.0.0.1:9300][cluster:monitor/state] request_id [7] timed out after [5001ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:925)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I roll back the transport client to 5.3.0 but keep backend 5.4.0.

Then it is able to connect to Es backend.
I use SBT and the build dependencies for the error are:

"io.netty" % "netty-all" % "4.1.9.Final"
"org.elasticsearch" % "elasticsearch" % "5.4.0"
"org.elasticsearch.client" % "transport" % "5.4.0",
and "io.netty" % "netty-transport-native-epoll" % "4.1.9.Final" classifier "linux-x86_64"

Environment:

openjdk version "1.8.0_121"
OpenJDK Runtime Environment (IcedTea 3.3.0) (suse-3.3-x86_64)
OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)

Linux linux-68qh 4.10.13-1-default #1 SMP PREEMPT Thu Apr 27 12:23:31 UTC 2017 (e5d11ce) x86_64 x86_64 x86_64 GNU/Linux

Thanks

@tlrx tlrx added :Distributed Coordination/Network Http and internode communication implementations >bug v5.4.0 labels May 11, 2017
@tlrx
Copy link
Member

tlrx commented May 11, 2017

It looks like a bug to me. Is sniffing enabled on your transport client?

@fangqiao
Copy link
Author

Yes it is enabled.

@jloisel
Copy link

jloisel commented May 11, 2017

Same issue here. We have:

  • Spring Boot v1.5.3.RELEASE,
  • Switched from Elasticsearch 5.3.2 to 5.4.0,
  • using Transport Client with sniff enabled.

Client and Elasticsearch both on the same machine, connecting through localhost:

  • When using TransportClient 5.3.2 to connect to Elastic 5.4.0 => OK,
  • 5.4.0 to 5.4.0 => KO.

The exception we have on startup:

org.elasticsearch.transport.ReceiveTimeoutTransportException: [][127.0.0.1:9300][cluster:monitor/state] request_id [7] timed out after [5000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:925) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.4.0.jar:5.4.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131]

@nwadams
Copy link

nwadams commented May 11, 2017

I am seeing this issue as well on some nodes connecting to ES. We run a service that has multiple machines that each connect to ES, some of them are able to connect successfully and others do not.

@tlrx
Copy link
Member

tlrx commented May 11, 2017

Thanks for reporting, I think I know where the issue is.

tlrx added a commit to tlrx/elasticsearch that referenced this issue May 11, 2017
With the current implementation, SniffNodesSampler might close the
current connection right after a request is sent but before the response
is correctly handled. This causes to timeouts in the transport client
when the sniffing is activated.

closes elastic#24575
closes elastic#24557
@nwadams
Copy link

nwadams commented May 11, 2017

Thanks @tlrx. I'm not sure if you are also aware, but I also saw errors that looked like the following when I disabled sniffing.

20:38:44.935 [elasticsearch[_client_][generic][T#2]] DEBUG - failed to connect to discovered node [{i-0562d98cb14e42358}{Gzbd-MEzRo-OHMUoEajvXA}{x6V2--f3SS-NzVk5wAQQYg}{10.178.212.242}{127.0.0.1:4374}{aws_availability_zone=us-east-1a}]
ConnectTransportException[[i-0562d98cb14e42358][127.0.0.1:4374] handshake failed. unexpected remote node {i-01bae8d9b0f31ac54}{MUjAv_3JR5KmzEdn-eJeSA}{qJdTT_oaSRCJ1TLO1W2A6w}{10.158.100.27}{10.158.100.27:9300}{aws_availability_zone=us-east-1b}]
	at org.elasticsearch.transport.TransportService.lambda$connectToNode$3(TransportService.java:319)
	at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:466)
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:315)
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:302)
	at org.elasticsearch.client.transport.TransportClientNodesService$NodeSampler.validateNewNodes(TransportClientNodesService.java:374)
	at org.elasticsearch.client.transport.TransportClientNodesService$SimpleNodeSampler.doSample(TransportClientNodesService.java:442)
	at org.elasticsearch.client.transport.TransportClientNodesService$NodeSampler.sample(TransportClientNodesService.java:358)
	at org.elasticsearch.client.transport.TransportClientNodesService$ScheduledNodeSampler.run(TransportClientNodesService.java:391)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

If it helps we have a service discovery framework to discover services: (https://medium.com/airbnb-engineering/smartstack-service-discovery-in-the-cloud-4b8a080de619). We "randomly" pick an ES box to connect to and then use sniffing (if enabled) to discover the rest. Even though ES is running on 9200/9300 we use a different port on our client machines because of the service discovery framework does the correct routing. Both the service discovery port and the "direct access" port are reachable over the network.

I am rolling back our transport client version to 5.3.2 and will report back on the results.
Update: 5.3.2 works great

s1monw added a commit to s1monw/elasticsearch that referenced this issue May 12, 2017
…port handlers

Today we prune transport handlers in TransporService when a node is disconnected.
This can cause connections to starve in the TransportService if the connection is
opened as a short living connection ie. without sharing the connection to a node
via registering in the transport itself. This change now moves to pruning based
on the connections cache key to ensure we notify handlers as soon as the connection
is closed for all connections not just for registered connections.

Relates to elastic#24632
Relates to elastic#24575
Relates to elastic#24557
@githubdoramon
Copy link

Same here... 5.4.0 to 5.4.0 fails.... but 5.3.0 to 5.4.0 works

s1monw added a commit that referenced this issue May 12, 2017
…port handlers (#24639)

Today we prune transport handlers in TransportService when a node is disconnected.
This can cause connections to starve in the TransportService if the connection is
opened as a short living connection ie. without sharing the connection to a node
via registering in the transport itself. This change now moves to pruning based
on the connections cache key to ensure we notify handlers as soon as the connection
is closed for all connections not just for registered connections.

Relates to #24632
Relates to #24575
Relates to #24557
s1monw added a commit that referenced this issue May 12, 2017
…port handlers (#24639)

Today we prune transport handlers in TransportService when a node is disconnected.
This can cause connections to starve in the TransportService if the connection is
opened as a short living connection ie. without sharing the connection to a node
via registering in the transport itself. This change now moves to pruning based
on the connections cache key to ensure we notify handlers as soon as the connection
is closed for all connections not just for registered connections.

Relates to #24632
Relates to #24575
Relates to #24557
s1monw added a commit that referenced this issue May 12, 2017
…port handlers (#24639)

Today we prune transport handlers in TransportService when a node is disconnected.
This can cause connections to starve in the TransportService if the connection is
opened as a short living connection ie. without sharing the connection to a node
via registering in the transport itself. This change now moves to pruning based
on the connections cache key to ensure we notify handlers as soon as the connection
is closed for all connections not just for registered connections.

Relates to #24632
Relates to #24575
Relates to #24557
tlrx added a commit that referenced this issue May 12, 2017
…24632)

With the current implementation, SniffNodesSampler might close the
current connection right after a request is sent but before the response
is correctly handled. This causes to timeouts in the transport client
when the sniffing is activated.

closes #24575
closes #24557
tlrx added a commit that referenced this issue May 12, 2017
…24632)

With the current implementation, SniffNodesSampler might close the
current connection right after a request is sent but before the response
is correctly handled. This causes to timeouts in the transport client
when the sniffing is activated.

closes #24575
closes #24557
tlrx added a commit that referenced this issue May 12, 2017
…24632)

With the current implementation, SniffNodesSampler might close the
current connection right after a request is sent but before the response
is correctly handled. This causes to timeouts in the transport client
when the sniffing is activated.

closes #24575
closes #24557
@jinnah79
Copy link

Same here... 5.4.0 to 5.4.0 fails.... but 5.3.0 to 5.4.0 works

@debraj-manna
Copy link

I am seeing a similar exception in 2.3.1. Below is the exception:-

INFO [2017-08-08 20:14:18,019] [U:3,129,F:822,T:3,950,M:3,950] elasticsearch.client.transport:[TransportClientNodesService$SniffNodesSampler$1$1:handleException:455] - [elasticsearch[Edward "Ned" Buckman][generic][T#61]] - [Edward "Ned" Buckman] failed to get local cluster state for {#transport#-1}{127.0.0.1}{localhost/127.0.0.1:9300}, disconnecting...
ReceiveTimeoutTransportException[[][localhost/127.0.0.1:9300][cluster:monitor/state] request_id [341654] timed out after [5001ms]]
	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:679)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Is the issue not fixed in 2.3.1?

@tlrx
Copy link
Member

tlrx commented Oct 9, 2017

Is the issue not fixed in 2.3.1?

I didn't test in 2.3.1 since the fix fixed a bug introduced in #22828 for 5.4.0. It's possible that this bug exists in 2.3.1 but this version is EOL and not supported anymore.

@debraj-manna
Copy link

debraj-manna commented Oct 9, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Network Http and internode communication implementations feedback_needed v5.4.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants