OOM due to large number of requests in TransportService.clientHandlers #50241

njustyq · 2019-12-16T16:31:11Z

Describe the feature:

Elasticsearch version (bin/elasticsearch --6.3.2):

Plugins installed: [repository-hdfs]

JVM version (java -version):10.0.2

OS version (uname -a if on a Unix-like system):centos7.2

Description of the problem including expected versus actual behavior:
our cluster is 24 data nodes with 31G heap and 1.7T*4 ssd disk , 3 master with 8G heap ,about 8000tps for write. we rolling upgrade(6.3.2 to 6.3.2) the cluster as the following steps:
①set allocation to none
②restart the data node
③set allocation to all
wait for the health from yellow to green.
When i finished upgrade part of the data nodes，i had waited a while,i found the master node old gc and run OOM later.
So I loaded heap dump from master node into Eclipse MemoryAnalyzer and found that 87.57% of memory is used by TransportService.clientHandlers hash map,
most of the RequestHolder was consist of like the action:indices:monitor/stats[n]、indices:monitor/recovery[n] or cluster:monitor/stats[n],below is the pic of heap dump:

and i use the OQL
SELECT toString(action) FROM org.elasticsearch.transport.TransportService$RequestHolder
to statistic the action,result is as follows:

So,are there any bugs here or master overload?

The text was updated successfully, but these errors were encountered:

njustyq · 2019-12-17T02:55:04Z

@elasticmachine Can someone check it out for me?

njustyq · 2019-12-17T06:50:26Z

@jimczi Could you please take a look at it for me？

elasticmachine · 2019-12-17T09:18:22Z

Pinging @elastic/es-core-features (:Core/Features/Monitoring)

jasontedor · 2019-12-17T11:13:54Z

@njustyq Please stop pinging people directly. Someone will look at your issue, yet so expectations are clear, there is no implicit nor explicit SLA here.

njustyq · 2019-12-17T11:27:19Z

@jasontedor sorry，I'm just a little anxious. Thx for your attention to my issue.

jakelandis · 2019-12-18T20:14:16Z

@njustyq The master node is responsible for collecting all of the data needed to push to monitoring. I suspect that your environment has a very large number of shards (the usual culprit for high memory usage for monitoring with indices:monitor/stats) and/or other large number of things that the collectors need to collect: https://www.elastic.co/guide/en/elasticsearch/reference/6.8/es-monitoring-collectors.html.

I think this may be the case where more memory is needed for master, or have less things in your cluster (i.e. shards) that need to be monitored.

Newer versions of the Elastic stack (6.5+) allow for you turn off the collectors and let Metricbeat do the work, which I believe would help here. https://www.elastic.co/guide/en/elasticsearch/reference/6.5/configuring-metricbeat.html

I am going to close this issue, as I can not find anything actionable for a bug or enhancement change.

njustyq · 2019-12-19T08:13:44Z

@njustyq The master node is responsible for collecting all of the data needed to push to monitoring. I suspect that your environment has a very large number of shards (the usual culprit for high memory usage for monitoring with indices:monitor/stats) and/or other large number of things that the collectors need to collect: https://www.elastic.co/guide/en/elasticsearch/reference/6.8/es-monitoring-collectors.html.

I think this may be the case where more memory is needed for master, or have less things in your cluster (i.e. shards) that need to be monitored.

Newer versions of the Elastic stack (6.5+) allow for you turn off the collectors and let Metricbeat do the work, which I believe would help here. https://www.elastic.co/guide/en/elasticsearch/reference/6.5/configuring-metricbeat.html

I am going to close this issue, as I can not find anything actionable for a bug or enhancement change.

I had checked the network between the master node and the data node, and it was ok.
Why are there accumulated so many RequestHolder in the clientHandlers which hold like indices:monitor/stats[n],indices:monitor/recovery[n] or cluster:monitor/stats[n].

When i check the log from master node,and it is so many collect timeout log like "[ERROR][o.e.x.m.c.i.IndexStatsCollector] [Master_9.38.149.191_0] collector [index-stats] timed out when collecting data", "ReceiveTimeoutTransportException: [Data238_0][9.38.1.238:1110][cluster:monitor/nodes/stats[n]] request_id [528073222] timed out after [15000ms]"
and in the Data238_0(data node),there is no log until master leave. The log as follows:

njustyq · 2019-12-19T08:39:16Z

@jakelandis There are the log of master and data ,thx for checking it out.
master log:

data log:

This was referenced Dec 16, 2019

ShardActiveResponseHandler shouldn't hold to an entire cluster state #21470

Merged

Making ES more robust against "corrupt" nodes or node failures causing Master OOM #8881

Closed

markharwood added the :Data Management/Monitoring label Dec 17, 2019

jakelandis closed this as completed Dec 18, 2019

This was referenced Apr 21, 2020

Improve robustness of monitoring APIs #55550

Closed

[Metricbeat] Exponential backoff for http timeout in elasticsearch module elastic/beats#17948

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM due to large number of requests in TransportService.clientHandlers #50241

OOM due to large number of requests in TransportService.clientHandlers #50241

njustyq commented Dec 16, 2019 •

edited

Loading

njustyq commented Dec 17, 2019

njustyq commented Dec 17, 2019

elasticmachine commented Dec 17, 2019

jasontedor commented Dec 17, 2019

njustyq commented Dec 17, 2019

jakelandis commented Dec 18, 2019

njustyq commented Dec 19, 2019

njustyq commented Dec 19, 2019

OOM due to large number of requests in TransportService.clientHandlers #50241

OOM due to large number of requests in TransportService.clientHandlers #50241

Comments

njustyq commented Dec 16, 2019 • edited Loading

njustyq commented Dec 17, 2019

njustyq commented Dec 17, 2019

elasticmachine commented Dec 17, 2019

jasontedor commented Dec 17, 2019

njustyq commented Dec 17, 2019

jakelandis commented Dec 18, 2019

njustyq commented Dec 19, 2019

njustyq commented Dec 19, 2019

njustyq commented Dec 16, 2019 •

edited

Loading