Elasticsearch Transport Client fails to recovery connection after cluster restart #24557

merlob · 2017-05-09T10:13:37Z

Elasticsearch version:
5.4.0

Plugins installed:
Node

JVM version:
1.8.0_102

OS version:
Linux globevm 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
I'm using Elasticsearch 5.4.0 with Transport Client with the following problems:

you must launch the application several times before it can connect to the cluster.
when connection is established, after a cluster restart, the connection is no more recovered, with this stack:

NoNodeAvailableException[None of the configured nodes are available: [{#transport#-1}{s-HKw1m4S9aMgCkx5iBuYg}{192.168.203.128}{192.168.203.128:9500}]]
at org.elasticsearch.client.transport.TransportClientNodesService.ensureNodesAreAvailable(TransportClientNodesService.java:348)
at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:246)
at org.elasticsearch.client.transport.TransportProxyClient.execute(TransportProxyClient.java:59)
at org.elasticsearch.client.transport.TransportClient.doExecute(TransportClient.java:366)
at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:408)
at org.elasticsearch.client.support.AbstractClient$ClusterAdmin.execute(AbstractClient.java:730)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:80)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:54)
at org.elasticsearch.action.ActionRequestBuilder.get(ActionRequestBuilder.java:69)

The Transport Client is setting as following:

        Settings.Builder settingsBuilder = Settings.builder();

		settingsBuilder.put("cluster.name", "globevmes5");
		settingsBuilder.put("client.transport.sniff", true);
		 	
		client = new PreBuiltTransportClient(settingsBuilder.build());
		try {
			client.addTransportAddress(new 
                              InetSocketTransportAddress(InetAddress.getByName("192.168.203.128"), 9500));
			
		} catch (Exception e) {
			System.out.println(e.getMessage());	
		}

Elastic node configuration:

network.host: 192.168.203.128
http.port: 9400
transport.profiles.default.port: 9500-9600

With previous Elasticsearch 5.3.2 it worked fine.
Setting "client.transport.sniff" to false works fine.

Provide logs:
Elastic node log:

[2017-05-09T11:14:24,336][WARN ][o.e.b.Natives            ] unable to load JNA native support library, native methods will be disabled.
java.lang.UnsatisfiedLinkError: /tmp/jna--1077556979/jna7634687564598757394.tmp: /lib64/libc.so.6: version `GLIBC_2.7' not found (required by /tmp/jna--1077556979/jna7634687564598757394.tmp)
        at java.lang.ClassLoader$NativeLibrary.load(Native Method) ~[?:1.8.0_102]
        at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941) ~[?:1.8.0_102]
        at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824) ~[?:1.8.0_102]
        at java.lang.Runtime.load0(Runtime.java:809) ~[?:1.8.0_102]
        at java.lang.System.load(System.java:1086) ~[?:1.8.0_102]
        at com.sun.jna.Native.loadNativeDispatchLibraryFromClasspath(Native.java:947) ~[jna-4.4.0.jar:4.4.0 (b0)]
        at com.sun.jna.Native.loadNativeDispatchLibrary(Native.java:922) ~[jna-4.4.0.jar:4.4.0 (b0)]
        at com.sun.jna.Native.<clinit>(Native.java:190) ~[jna-4.4.0.jar:4.4.0 (b0)]
        at java.lang.Class.forName0(Native Method) ~[?:1.8.0_102]
        at java.lang.Class.forName(Class.java:264) ~[?:1.8.0_102]
        at org.elasticsearch.bootstrap.Natives.<clinit>(Natives.java:45) [elasticsearch-5.4.0.jar:5.4.0]
        at org.elasticsearch.bootstrap.Bootstrap.initializeNatives(Bootstrap.java:105) [elasticsearch-5.4.0.jar:5.4.0]
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:204) [elasticsearch-5.4.0.jar:5.4.0]
        at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:360) [elasticsearch-5.4.0.jar:5.4.0]
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:123) [elasticsearch-5.4.0.jar:5.4.0]
        at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:114) [elasticsearch-5.4.0.jar:5.4.0]
        at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:67) [elasticsearch-5.4.0.jar:5.4.0]
        at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:122) [elasticsearch-5.4.0.jar:5.4.0]
        at org.elasticsearch.cli.Command.main(Command.java:88) [elasticsearch-5.4.0.jar:5.4.0]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:91) [elasticsearch-5.4.0.jar:5.4.0]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:84) [elasticsearch-5.4.0.jar:5.4.0]
[2017-05-09T11:14:24,342][WARN ][o.e.b.Natives            ] cannot check if running as root because JNA is not available
[2017-05-09T11:14:24,342][WARN ][o.e.b.Natives            ] cannot register console handler because JNA is not available
[2017-05-09T11:14:24,344][WARN ][o.e.b.Natives            ] cannot getrlimit RLIMIT_NPROC because JNA is not available
[2017-05-09T11:14:24,344][WARN ][o.e.b.Natives            ] cannot getrlimit RLIMIT_AS beacuse JNA is not available
[2017-05-09T11:14:24,493][INFO ][o.e.n.Node               ] [globevmes5-node] initializing ...
[2017-05-09T11:14:24,615][INFO ][o.e.e.NodeEnvironment    ] [globevmes5-node] using [1] data paths, mounts [[/methode (/dev/mapper/VolGroup01-LogVol02)]], net usable_space [8.1gb], net total_space [72.8gb], spins? [possibly], types [ext3]
[2017-05-09T11:14:24,615][INFO ][o.e.e.NodeEnvironment    ] [globevmes5-node] heap size [1007.3mb], compressed ordinary object pointers [true]
[2017-05-09T11:14:24,659][INFO ][o.e.n.Node               ] [globevmes5-node] node name [globevmes5-node], node ID [M1_iHcSKRX6wHkD_Va0uDg]
[2017-05-09T11:14:24,659][INFO ][o.e.n.Node               ] [globevmes5-node] version[5.4.0], pid[31204], build[780f8c4/2017-04-28T17:43:27.229Z], OS[Linux/2.6.18-194.el5/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_102/25.102-b14]
[2017-05-09T11:14:26,981][INFO ][o.e.p.PluginsService     ] [globevmes5-node] loaded module [aggs-matrix-stats]
[2017-05-09T11:14:26,981][INFO ][o.e.p.PluginsService     ] [globevmes5-node] loaded module [ingest-common]
[2017-05-09T11:14:26,981][INFO ][o.e.p.PluginsService     ] [globevmes5-node] loaded module [lang-expression]
[2017-05-09T11:14:26,981][INFO ][o.e.p.PluginsService     ] [globevmes5-node] loaded module [lang-groovy]
[2017-05-09T11:14:26,981][INFO ][o.e.p.PluginsService     ] [globevmes5-node] loaded module [lang-mustache]
[2017-05-09T11:14:26,981][INFO ][o.e.p.PluginsService     ] [globevmes5-node] loaded module [lang-painless]
[2017-05-09T11:14:26,981][INFO ][o.e.p.PluginsService     ] [globevmes5-node] loaded module [percolator]
[2017-05-09T11:14:26,981][INFO ][o.e.p.PluginsService     ] [globevmes5-node] loaded module [reindex]
[2017-05-09T11:14:26,981][INFO ][o.e.p.PluginsService     ] [globevmes5-node] loaded module [transport-netty3]
[2017-05-09T11:14:26,981][INFO ][o.e.p.PluginsService     ] [globevmes5-node] loaded module [transport-netty4]
[2017-05-09T11:14:26,982][INFO ][o.e.p.PluginsService     ] [globevmes5-node] loaded plugin [eom-elasticsearch-plugin]
[2017-05-09T11:14:28,406][INFO ][o.e.d.DiscoveryModule    ] [globevmes5-node] using discovery type [zen]
[2017-05-09T11:14:28,972][INFO ][o.e.n.Node               ] [globevmes5-node] initialized
[2017-05-09T11:14:28,972][INFO ][o.e.n.Node               ] [globevmes5-node] starting ...
[2017-05-09T11:14:29,100][INFO ][o.e.t.TransportService   ] [globevmes5-node] publish_address {192.168.203.128:9500}, bound_addresses {192.168.203.128:9500}
[2017-05-09T11:14:29,106][INFO ][o.e.b.BootstrapChecks    ] [globevmes5-node] bound or publishing to a non-loopback or non-link-local address, enforcing bootstrap checks
[2017-05-09T11:14:32,227][INFO ][o.e.c.s.ClusterService   ] [globevmes5-node] new_master {globevmes5-node}{M1_iHcSKRX6wHkD_Va0uDg}{j8GR_IGFSfW2y502jd-SKA}{192.168.203.128}{192.168.203.128:9500}, reason: zen-disco-elected-as-master ([0] nodes joined)
[2017-05-09T11:14:32,379][INFO ][o.e.h.n.Netty4HttpServerTransport] [globevmes5-node] publish_address {192.168.203.128:9400}, bound_addresses {192.168.203.128:9400}
[2017-05-09T11:14:32,386][INFO ][o.e.n.Node               ] [globevmes5-node] started
[2017-05-09T11:14:32,590][INFO ][o.e.g.GatewayService     ] [globevmes5-node] recovered [6] indices into cluster_state
[2017-05-09T11:14:33,173][INFO ][o.e.c.r.a.AllocationService] [globevmes5-node] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[.kibana][0]] ...]).
[2017-05-09T11:15:02,357][INFO ][o.e.c.r.a.DiskThresholdMonitor] [globevmes5-node] low disk watermark [85%] exceeded on [M1_iHcSKRX6wHkD_Va0uDg][globevmes5-node][/methode/meth01/mnt/elasticsearch-5.4.0/data/nodes/0] free: 8.1gb[11.1%], replicas will not be assigned to this node

The text was updated successfully, but these errors were encountered:

tlrx · 2017-05-11T10:34:40Z

It looks like a bug to me, TransportClient's cluster state requests timed out in my local tests. It seems like some requests hang out, maybe because of a concurrent disconnection or a Netty issue. @jasontedor or @bleskes can you have a look?

With the current implementation, SniffNodesSampler might close the current connection right after a request is sent but before the response is correctly handled. This causes to timeouts in the transport client when the sniffing is activated. closes elastic#24575 closes elastic#24557

…port handlers Today we prune transport handlers in TransporService when a node is disconnected. This can cause connections to starve in the TransportService if the connection is opened as a short living connection ie. without sharing the connection to a node via registering in the transport itself. This change now moves to pruning based on the connections cache key to ensure we notify handlers as soon as the connection is closed for all connections not just for registered connections. Relates to elastic#24632 Relates to elastic#24575 Relates to elastic#24557

…port handlers (#24639) Today we prune transport handlers in TransportService when a node is disconnected. This can cause connections to starve in the TransportService if the connection is opened as a short living connection ie. without sharing the connection to a node via registering in the transport itself. This change now moves to pruning based on the connections cache key to ensure we notify handlers as soon as the connection is closed for all connections not just for registered connections. Relates to #24632 Relates to #24575 Relates to #24557

…24632) With the current implementation, SniffNodesSampler might close the current connection right after a request is sent but before the response is correctly handled. This causes to timeouts in the transport client when the sniffing is activated. closes #24575 closes #24557

tlrx added :Distributed Coordination/Network Http and internode communication implementations >bug v5.4.0 labels May 11, 2017

tlrx mentioned this issue May 11, 2017

SniffNodesSampler should close connection after handling responses #24632

Merged

s1monw mentioned this issue May 12, 2017

Notify onConnectionClosed rather than onNodeDisconnect to prune transport handlers #24639

Merged

tlrx closed this as completed in #24632 May 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch Transport Client fails to recovery connection after cluster restart #24557

Elasticsearch Transport Client fails to recovery connection after cluster restart #24557

merlob commented May 9, 2017

tlrx commented May 11, 2017

Elasticsearch Transport Client fails to recovery connection after cluster restart #24557

Elasticsearch Transport Client fails to recovery connection after cluster restart #24557

Comments

merlob commented May 9, 2017

tlrx commented May 11, 2017