-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nested RemoteTransportExceptions flood the logs and fill the disk #19187
Comments
I think that this is a manifestation of #12573. It can happen when the target node of a primary relocation takes a long time to apply the cluster state that contains the information that it has the new primary. It is fixed in the upcoming v5.0.0 (#16274). The question is why the node took so long to apply a cluster state update. Is there anything else in the logs that might indicate this? What is the time stamp of the last log entry that has one of those huge exception traces? |
@ywelsch this is still unclear, the first trace was 08:34:20,553 and the last one was a StackOverflow at 09:48:19,128 :
Note that It's happening again right now with two other nodes elastic1021 and elastic1036 (still master). Unfortunately keeping the logs is difficult (disk full). |
It is tricky to verify that this is indeed #12573 (If so, we could think about backporting #16274). Once the exceptions start bubbling up, the nodes have up-to-date cluster states (i.e. the node with the primary relocation target now has the cluster state where primary relocation is completed). It’s just by unwinding the deep call stack of the recursive calls between the nodes where the exceptions are stacked on top of each other. Is the rolling restart of the cluster completed by now? If so, are there any shard relocations in progress? @bleskes thoughts? |
Note that I'm pretty sure that the second time it happened the cluster was green but certainly with shards relocating. I've restarted the master to schedule a new election, we'll monitor the cluster state and comment this ticket with any new relevant info. I agree with you, I'm not sure that exception timestamp in the logs are relevant because it seems to be a recursion problem and most of the logs where generated by the circuit breaker in ExceptionsHelper#unwrapCause (I wonder if the same kind of circuit breaker should be added to the logger itself to avoid writing bazillions of Caused By lines). |
…lper#unwrapCause This code seems to be a circuit breaker to prevent digging too deeply in the exception causes. If we enter this circuit breaker then it's likely that something very bad is happening. Unfortunately we ask the logger to log the full stack of all the causes. It can generate gigs of logs in few minutes and filling up all disk space. It happened on a cluster affected by what seem to be recursion bug (c.f. elastic#19187) where it generated 27gig of logs in less than 5 minutes. While this code is useful to debug problematic exceptions it may generate too many lines causing "no space left on devices" errors thus making debugging the root cause even harder.
Can you share a pair of subsequent log messages, one with |
It happened again today.
It's unclear to me how node_concurrent_recoveries and cluster_concurrent_rebalance interacts together. What happens if the cluster decide to rebalance more than 3 shards to the same node will node_concurrent_recoveries prevent this from happening? I think that what saves us from OOM is a StackOverflow when the huge exception is serialized. @jasontedor here is the first 4 log entries : (https://gist.github.com/nomoa/2ee1f8bb44a4c6c01c400787d66bc383) Here the pattern seems to be 2 with 13 cause, 2 with 15 causes and so on... |
@nomoa I've back-ported the fix in #12573 to 2.4 (#19296). All information so far indicates that it is this issue you're experiencing. Unfortunately my back-port was too late to make it for 2.3.4. You will have to wait for 2.4.0 to test it out. In the meantime, I wonder if dedicated master nodes would help here. If I understand correctly, this issue appeared only when a primary shard on the master node was involved. As cluster states are first applied on all the other nodes before it's applied on the master node, and if cluster state application is slow (due to large number of indices / shards etc.), having dedicated master nodes might decrease the time in which cluster states are out of sync on the nodes holding the primary relocation source and target. Might also be interesting to increase logging level of "org.elasticsearch.cluster.service" to DEBUG to see how long nodes take to apply the cluster state (messages of the form "processing [{}]: took {} done applying updated cluster_state"). |
@ywelsch awesome, thanks for the backport. Yes it always happened on shards where the master was involved, and if I understood correctly this specific issue could happen between two data nodes. Note that It's not the first time we suspect the master being too busy to act properly. Moving to a dedicated master node is on our todo list, thanks for the suggestions. |
Happening for me as well. I have disabled logging for the time being. Waiting for ES 2.4.0 :) |
Closed by #19296 Once 2.4.0 is out, please ping on this ticket if you're still seeing the same issue. |
It happened during a rolling restart needed for a security upgrade. The cluster is running elastic 2.3.3.
All nodes are running the same JVM version (OpenJDK 64-Bit Server VM (build 24.95-b01, mixed mode)).
A RemoteTransportException seemed to "loop?" between 2 nodes causing elastic to log bigger and bigger exception traces as a new RemoteException exception seemed to be created with the previous one carrying all its causes.
The first trace was (on elastic1045) :
The second one (same root cause) appeared few ms after with also 12 causes.
The third and fourth ones had 14 causes, fifth and sixth 16 causes and so on...
The last one I've seen had 1982 chained causes.
The logs were nearly the same on elastic1036 (master) generating 27gig of logs in few minutes on both nodes.
Surprisingly the cluster was still performing relatively well with higher gc activity on these nodes.
Then (maybe 1 hour after the first trace) elastic1045 was dropped from the cluster:
It was immediately re-added and the log flood stopped.
I'll comment on this ticket if it happens again.
The text was updated successfully, but these errors were encountered: