Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM due to large number of requests in TransportService.clientHandlers #50241

Closed
njustyq opened this issue Dec 16, 2019 · 8 comments
Closed

OOM due to large number of requests in TransportService.clientHandlers #50241

njustyq opened this issue Dec 16, 2019 · 8 comments

Comments

@njustyq
Copy link

njustyq commented Dec 16, 2019

Describe the feature:

Elasticsearch version (bin/elasticsearch --6.3.2):

Plugins installed: [repository-hdfs]

JVM version (java -version):10.0.2

OS version (uname -a if on a Unix-like system):centos7.2

Description of the problem including expected versus actual behavior:
our cluster is 24 data nodes with 31G heap and 1.7T*4 ssd disk , 3 master with 8G heap ,about 8000tps for write. we rolling upgrade(6.3.2 to 6.3.2) the cluster as the following steps:
①set allocation to none
②restart the data node
③set allocation to all
wait for the health from yellow to green.
When i finished upgrade part of the data nodes,i had waited a while,i found the master node old gc and run OOM later.
So I loaded heap dump from master node into Eclipse MemoryAnalyzer and found that 87.57% of memory is used by TransportService.clientHandlers hash map,
most of the RequestHolder was consist of like the action:indices:monitor/stats[n]、indices:monitor/recovery[n] or cluster:monitor/stats[n],below is the pic of heap dump:
client
1
2

and i use the OQL
SELECT toString(action) FROM org.elasticsearch.transport.TransportService$RequestHolder
to statistic the action,result is as follows:

action

So,are there any bugs here or master overload?

@njustyq
Copy link
Author

njustyq commented Dec 17, 2019

@elasticmachine Can someone check it out for me?

@njustyq
Copy link
Author

njustyq commented Dec 17, 2019

@jimczi Could you please take a look at it for me?

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Monitoring)

@jasontedor
Copy link
Member

@njustyq Please stop pinging people directly. Someone will look at your issue, yet so expectations are clear, there is no implicit nor explicit SLA here.

@njustyq
Copy link
Author

njustyq commented Dec 17, 2019

@jasontedor sorry,I'm just a little anxious. Thx for your attention to my issue.

@jakelandis
Copy link
Contributor

@njustyq The master node is responsible for collecting all of the data needed to push to monitoring. I suspect that your environment has a very large number of shards (the usual culprit for high memory usage for monitoring with indices:monitor/stats) and/or other large number of things that the collectors need to collect: https://www.elastic.co/guide/en/elasticsearch/reference/6.8/es-monitoring-collectors.html.

I think this may be the case where more memory is needed for master, or have less things in your cluster (i.e. shards) that need to be monitored.

Newer versions of the Elastic stack (6.5+) allow for you turn off the collectors and let Metricbeat do the work, which I believe would help here. https://www.elastic.co/guide/en/elasticsearch/reference/6.5/configuring-metricbeat.html

I am going to close this issue, as I can not find anything actionable for a bug or enhancement change.

@njustyq
Copy link
Author

njustyq commented Dec 19, 2019

@njustyq The master node is responsible for collecting all of the data needed to push to monitoring. I suspect that your environment has a very large number of shards (the usual culprit for high memory usage for monitoring with indices:monitor/stats) and/or other large number of things that the collectors need to collect: https://www.elastic.co/guide/en/elasticsearch/reference/6.8/es-monitoring-collectors.html.

I think this may be the case where more memory is needed for master, or have less things in your cluster (i.e. shards) that need to be monitored.

Newer versions of the Elastic stack (6.5+) allow for you turn off the collectors and let Metricbeat do the work, which I believe would help here. https://www.elastic.co/guide/en/elasticsearch/reference/6.5/configuring-metricbeat.html

I am going to close this issue, as I can not find anything actionable for a bug or enhancement change.

I had checked the network between the master node and the data node, and it was ok.
Why are there accumulated so many RequestHolder in the clientHandlers which hold like indices:monitor/stats[n],indices:monitor/recovery[n] or cluster:monitor/stats[n].

When i check the log from master node,and it is so many collect timeout log like "[ERROR][o.e.x.m.c.i.IndexStatsCollector] [Master_9.38.149.191_0] collector [index-stats] timed out when collecting data", "ReceiveTimeoutTransportException: [Data238_0][9.38.1.238:1110][cluster:monitor/nodes/stats[n]] request_id [528073222] timed out after [15000ms]"
and in the Data238_0(data node),there is no log until master leave. The log as follows:

@njustyq
Copy link
Author

njustyq commented Dec 19, 2019

@jakelandis There are the log of master and data ,thx for checking it out.
master log:
master日志

data log:
data日志

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants