Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch Transport Client fails to recovery connection after cluster restart #24557

Closed
merlob opened this issue May 9, 2017 · 1 comment · Fixed by #24632
Closed

Elasticsearch Transport Client fails to recovery connection after cluster restart #24557

merlob opened this issue May 9, 2017 · 1 comment · Fixed by #24632
Labels
>bug :Distributed Coordination/Network Http and internode communication implementations v5.4.0

Comments

@merlob
Copy link

merlob commented May 9, 2017

Elasticsearch version:
5.4.0

Plugins installed:
Node

JVM version:
1.8.0_102

OS version:
Linux globevm 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
I'm using Elasticsearch 5.4.0 with Transport Client with the following problems:

  • you must launch the application several times before it can connect to the cluster.
  • when connection is established, after a cluster restart, the connection is no more recovered, with this stack:
NoNodeAvailableException[None of the configured nodes are available: [{#transport#-1}{s-HKw1m4S9aMgCkx5iBuYg}{192.168.203.128}{192.168.203.128:9500}]]
at org.elasticsearch.client.transport.TransportClientNodesService.ensureNodesAreAvailable(TransportClientNodesService.java:348)
at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:246)
at org.elasticsearch.client.transport.TransportProxyClient.execute(TransportProxyClient.java:59)
at org.elasticsearch.client.transport.TransportClient.doExecute(TransportClient.java:366)
at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:408)
at org.elasticsearch.client.support.AbstractClient$ClusterAdmin.execute(AbstractClient.java:730)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:80)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:54)
at org.elasticsearch.action.ActionRequestBuilder.get(ActionRequestBuilder.java:69)

The Transport Client is setting as following:

        Settings.Builder settingsBuilder = Settings.builder();

		settingsBuilder.put("cluster.name", "globevmes5");
		settingsBuilder.put("client.transport.sniff", true);
		 	
		client = new PreBuiltTransportClient(settingsBuilder.build());
		try {
			client.addTransportAddress(new 
                              InetSocketTransportAddress(InetAddress.getByName("192.168.203.128"), 9500));
			
		} catch (Exception e) {
			System.out.println(e.getMessage());	
		}

Elastic node configuration:

  • network.host: 192.168.203.128
  • http.port: 9400
  • transport.profiles.default.port: 9500-9600

With previous Elasticsearch 5.3.2 it worked fine.
Setting "client.transport.sniff" to false works fine.

Provide logs:
Elastic node log:

[2017-05-09T11:14:24,336][WARN ][o.e.b.Natives            ] unable to load JNA native support library, native methods will be disabled.
java.lang.UnsatisfiedLinkError: /tmp/jna--1077556979/jna7634687564598757394.tmp: /lib64/libc.so.6: version `GLIBC_2.7' not found (required by /tmp/jna--1077556979/jna7634687564598757394.tmp)
        at java.lang.ClassLoader$NativeLibrary.load(Native Method) ~[?:1.8.0_102]
        at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941) ~[?:1.8.0_102]
        at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824) ~[?:1.8.0_102]
        at java.lang.Runtime.load0(Runtime.java:809) ~[?:1.8.0_102]
        at java.lang.System.load(System.java:1086) ~[?:1.8.0_102]
        at com.sun.jna.Native.loadNativeDispatchLibraryFromClasspath(Native.java:947) ~[jna-4.4.0.jar:4.4.0 (b0)]
        at com.sun.jna.Native.loadNativeDispatchLibrary(Native.java:922) ~[jna-4.4.0.jar:4.4.0 (b0)]
        at com.sun.jna.Native.<clinit>(Native.java:190) ~[jna-4.4.0.jar:4.4.0 (b0)]
        at java.lang.Class.forName0(Native Method) ~[?:1.8.0_102]
        at java.lang.Class.forName(Class.java:264) ~[?:1.8.0_102]
        at org.elasticsearch.bootstrap.Natives.<clinit>(Natives.java:45) [elasticsearch-5.4.0.jar:5.4.0]
        at org.elasticsearch.bootstrap.Bootstrap.initializeNatives(Bootstrap.java:105) [elasticsearch-5.4.0.jar:5.4.0]
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:204) [elasticsearch-5.4.0.jar:5.4.0]
        at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:360) [elasticsearch-5.4.0.jar:5.4.0]
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:123) [elasticsearch-5.4.0.jar:5.4.0]
        at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:114) [elasticsearch-5.4.0.jar:5.4.0]
        at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:67) [elasticsearch-5.4.0.jar:5.4.0]
        at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:122) [elasticsearch-5.4.0.jar:5.4.0]
        at org.elasticsearch.cli.Command.main(Command.java:88) [elasticsearch-5.4.0.jar:5.4.0]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:91) [elasticsearch-5.4.0.jar:5.4.0]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:84) [elasticsearch-5.4.0.jar:5.4.0]
[2017-05-09T11:14:24,342][WARN ][o.e.b.Natives            ] cannot check if running as root because JNA is not available
[2017-05-09T11:14:24,342][WARN ][o.e.b.Natives            ] cannot register console handler because JNA is not available
[2017-05-09T11:14:24,344][WARN ][o.e.b.Natives            ] cannot getrlimit RLIMIT_NPROC because JNA is not available
[2017-05-09T11:14:24,344][WARN ][o.e.b.Natives            ] cannot getrlimit RLIMIT_AS beacuse JNA is not available
[2017-05-09T11:14:24,493][INFO ][o.e.n.Node               ] [globevmes5-node] initializing ...
[2017-05-09T11:14:24,615][INFO ][o.e.e.NodeEnvironment    ] [globevmes5-node] using [1] data paths, mounts [[/methode (/dev/mapper/VolGroup01-LogVol02)]], net usable_space [8.1gb], net total_space [72.8gb], spins? [possibly], types [ext3]
[2017-05-09T11:14:24,615][INFO ][o.e.e.NodeEnvironment    ] [globevmes5-node] heap size [1007.3mb], compressed ordinary object pointers [true]
[2017-05-09T11:14:24,659][INFO ][o.e.n.Node               ] [globevmes5-node] node name [globevmes5-node], node ID [M1_iHcSKRX6wHkD_Va0uDg]
[2017-05-09T11:14:24,659][INFO ][o.e.n.Node               ] [globevmes5-node] version[5.4.0], pid[31204], build[780f8c4/2017-04-28T17:43:27.229Z], OS[Linux/2.6.18-194.el5/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_102/25.102-b14]
[2017-05-09T11:14:26,981][INFO ][o.e.p.PluginsService     ] [globevmes5-node] loaded module [aggs-matrix-stats]
[2017-05-09T11:14:26,981][INFO ][o.e.p.PluginsService     ] [globevmes5-node] loaded module [ingest-common]
[2017-05-09T11:14:26,981][INFO ][o.e.p.PluginsService     ] [globevmes5-node] loaded module [lang-expression]
[2017-05-09T11:14:26,981][INFO ][o.e.p.PluginsService     ] [globevmes5-node] loaded module [lang-groovy]
[2017-05-09T11:14:26,981][INFO ][o.e.p.PluginsService     ] [globevmes5-node] loaded module [lang-mustache]
[2017-05-09T11:14:26,981][INFO ][o.e.p.PluginsService     ] [globevmes5-node] loaded module [lang-painless]
[2017-05-09T11:14:26,981][INFO ][o.e.p.PluginsService     ] [globevmes5-node] loaded module [percolator]
[2017-05-09T11:14:26,981][INFO ][o.e.p.PluginsService     ] [globevmes5-node] loaded module [reindex]
[2017-05-09T11:14:26,981][INFO ][o.e.p.PluginsService     ] [globevmes5-node] loaded module [transport-netty3]
[2017-05-09T11:14:26,981][INFO ][o.e.p.PluginsService     ] [globevmes5-node] loaded module [transport-netty4]
[2017-05-09T11:14:26,982][INFO ][o.e.p.PluginsService     ] [globevmes5-node] loaded plugin [eom-elasticsearch-plugin]
[2017-05-09T11:14:28,406][INFO ][o.e.d.DiscoveryModule    ] [globevmes5-node] using discovery type [zen]
[2017-05-09T11:14:28,972][INFO ][o.e.n.Node               ] [globevmes5-node] initialized
[2017-05-09T11:14:28,972][INFO ][o.e.n.Node               ] [globevmes5-node] starting ...
[2017-05-09T11:14:29,100][INFO ][o.e.t.TransportService   ] [globevmes5-node] publish_address {192.168.203.128:9500}, bound_addresses {192.168.203.128:9500}
[2017-05-09T11:14:29,106][INFO ][o.e.b.BootstrapChecks    ] [globevmes5-node] bound or publishing to a non-loopback or non-link-local address, enforcing bootstrap checks
[2017-05-09T11:14:32,227][INFO ][o.e.c.s.ClusterService   ] [globevmes5-node] new_master {globevmes5-node}{M1_iHcSKRX6wHkD_Va0uDg}{j8GR_IGFSfW2y502jd-SKA}{192.168.203.128}{192.168.203.128:9500}, reason: zen-disco-elected-as-master ([0] nodes joined)
[2017-05-09T11:14:32,379][INFO ][o.e.h.n.Netty4HttpServerTransport] [globevmes5-node] publish_address {192.168.203.128:9400}, bound_addresses {192.168.203.128:9400}
[2017-05-09T11:14:32,386][INFO ][o.e.n.Node               ] [globevmes5-node] started
[2017-05-09T11:14:32,590][INFO ][o.e.g.GatewayService     ] [globevmes5-node] recovered [6] indices into cluster_state
[2017-05-09T11:14:33,173][INFO ][o.e.c.r.a.AllocationService] [globevmes5-node] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[.kibana][0]] ...]).
[2017-05-09T11:15:02,357][INFO ][o.e.c.r.a.DiskThresholdMonitor] [globevmes5-node] low disk watermark [85%] exceeded on [M1_iHcSKRX6wHkD_Va0uDg][globevmes5-node][/methode/meth01/mnt/elasticsearch-5.4.0/data/nodes/0] free: 8.1gb[11.1%], replicas will not be assigned to this node 
@tlrx tlrx added :Distributed Coordination/Network Http and internode communication implementations >bug v5.4.0 labels May 11, 2017
@tlrx
Copy link
Member

tlrx commented May 11, 2017

It looks like a bug to me, TransportClient's cluster state requests timed out in my local tests. It seems like some requests hang out, maybe because of a concurrent disconnection or a Netty issue. @jasontedor or @bleskes can you have a look?

tlrx added a commit to tlrx/elasticsearch that referenced this issue May 11, 2017
With the current implementation, SniffNodesSampler might close the
current connection right after a request is sent but before the response
is correctly handled. This causes to timeouts in the transport client
when the sniffing is activated.

closes elastic#24575
closes elastic#24557
s1monw added a commit to s1monw/elasticsearch that referenced this issue May 12, 2017
…port handlers

Today we prune transport handlers in TransporService when a node is disconnected.
This can cause connections to starve in the TransportService if the connection is
opened as a short living connection ie. without sharing the connection to a node
via registering in the transport itself. This change now moves to pruning based
on the connections cache key to ensure we notify handlers as soon as the connection
is closed for all connections not just for registered connections.

Relates to elastic#24632
Relates to elastic#24575
Relates to elastic#24557
s1monw added a commit that referenced this issue May 12, 2017
…port handlers (#24639)

Today we prune transport handlers in TransportService when a node is disconnected.
This can cause connections to starve in the TransportService if the connection is
opened as a short living connection ie. without sharing the connection to a node
via registering in the transport itself. This change now moves to pruning based
on the connections cache key to ensure we notify handlers as soon as the connection
is closed for all connections not just for registered connections.

Relates to #24632
Relates to #24575
Relates to #24557
s1monw added a commit that referenced this issue May 12, 2017
…port handlers (#24639)

Today we prune transport handlers in TransportService when a node is disconnected.
This can cause connections to starve in the TransportService if the connection is
opened as a short living connection ie. without sharing the connection to a node
via registering in the transport itself. This change now moves to pruning based
on the connections cache key to ensure we notify handlers as soon as the connection
is closed for all connections not just for registered connections.

Relates to #24632
Relates to #24575
Relates to #24557
s1monw added a commit that referenced this issue May 12, 2017
…port handlers (#24639)

Today we prune transport handlers in TransportService when a node is disconnected.
This can cause connections to starve in the TransportService if the connection is
opened as a short living connection ie. without sharing the connection to a node
via registering in the transport itself. This change now moves to pruning based
on the connections cache key to ensure we notify handlers as soon as the connection
is closed for all connections not just for registered connections.

Relates to #24632
Relates to #24575
Relates to #24557
tlrx added a commit that referenced this issue May 12, 2017
…24632)

With the current implementation, SniffNodesSampler might close the
current connection right after a request is sent but before the response
is correctly handled. This causes to timeouts in the transport client
when the sniffing is activated.

closes #24575
closes #24557
tlrx added a commit that referenced this issue May 12, 2017
…24632)

With the current implementation, SniffNodesSampler might close the
current connection right after a request is sent but before the response
is correctly handled. This causes to timeouts in the transport client
when the sniffing is activated.

closes #24575
closes #24557
tlrx added a commit that referenced this issue May 12, 2017
…24632)

With the current implementation, SniffNodesSampler might close the
current connection right after a request is sent but before the response
is correctly handled. This causes to timeouts in the transport client
when the sniffing is activated.

closes #24575
closes #24557
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Network Http and internode communication implementations v5.4.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants