-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM due to large number of requests in TransportService.clientHandlers #50241
Comments
@elasticmachine Can someone check it out for me? |
@jimczi Could you please take a look at it for me? |
Pinging @elastic/es-core-features (:Core/Features/Monitoring) |
@njustyq Please stop pinging people directly. Someone will look at your issue, yet so expectations are clear, there is no implicit nor explicit SLA here. |
@jasontedor sorry,I'm just a little anxious. Thx for your attention to my issue. |
@njustyq The master node is responsible for collecting all of the data needed to push to monitoring. I suspect that your environment has a very large number of shards (the usual culprit for high memory usage for monitoring with indices:monitor/stats) and/or other large number of things that the collectors need to collect: https://www.elastic.co/guide/en/elasticsearch/reference/6.8/es-monitoring-collectors.html. I think this may be the case where more memory is needed for master, or have less things in your cluster (i.e. shards) that need to be monitored. Newer versions of the Elastic stack (6.5+) allow for you turn off the collectors and let Metricbeat do the work, which I believe would help here. https://www.elastic.co/guide/en/elasticsearch/reference/6.5/configuring-metricbeat.html I am going to close this issue, as I can not find anything actionable for a bug or enhancement change. |
I had checked the network between the master node and the data node, and it was ok. When i check the log from master node,and it is so many collect timeout log like "[ERROR][o.e.x.m.c.i.IndexStatsCollector] [Master_9.38.149.191_0] collector [index-stats] timed out when collecting data", "ReceiveTimeoutTransportException: [Data238_0][9.38.1.238:1110][cluster:monitor/nodes/stats[n]] request_id [528073222] timed out after [15000ms]" |
@jakelandis There are the log of master and data ,thx for checking it out. |
Describe the feature:
Elasticsearch version (
bin/elasticsearch --6.3.2
):Plugins installed: [repository-hdfs]
JVM version (
java -version
):10.0.2OS version (
uname -a
if on a Unix-like system):centos7.2Description of the problem including expected versus actual behavior:
our cluster is 24 data nodes with 31G heap and 1.7T*4 ssd disk , 3 master with 8G heap ,about 8000tps for write. we rolling upgrade(6.3.2 to 6.3.2) the cluster as the following steps:
①set allocation to none
②restart the data node
③set allocation to all
wait for the health from yellow to green.
When i finished upgrade part of the data nodes,i had waited a while,i found the master node old gc and run OOM later.
So I loaded heap dump from master node into Eclipse MemoryAnalyzer and found that 87.57% of memory is used by TransportService.clientHandlers hash map,
most of the RequestHolder was consist of like the action:indices:monitor/stats[n]、indices:monitor/recovery[n] or cluster:monitor/stats[n],below is the pic of heap dump:
and i use the OQL
SELECT toString(action) FROM org.elasticsearch.transport.TransportService$RequestHolder
to statistic the action,result is as follows:
So,are there any bugs here or master overload?
The text was updated successfully, but these errors were encountered: