You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems that datadog-agent is using df -k to monitor disk space usage and this is a time consuming operation that, in certain conditions, could take more than the time between two checks.
As you can imagine, when this happens the server is effectively bring down by having a never-ending queue of jobs that are doing df.
This should never happen in the first place because if a probing call never finished the system should never start a new one, preventing such hammering operations.
I should add that one side effect of this is that the datadog reporting stops few hours before the machine goes down. The load on the machine is huge, like ~500 but the CPU is not loaded at all because all these. All the stuck jobs are in state "D" - disk sleep.
dd-agent info
===================
Collector (v 5.1.1)
===================
Status date: 2015-09-18 15:19:25 (12s ago)
Pid: 2219
Platform: Linux-3.2.0-4-amd64-x86_64-with-debian-7.8
Python Version: 2.7.8
Also, I should mention that this happened 2-3 times during the last months, on the same machine (just because that's the one that is doing a lot of NFS work).
The text was updated successfully, but these errors were encountered:
It seems that datadog-agent is using
df -k
to monitor disk space usage and this is a time consuming operation that, in certain conditions, could take more than the time between two checks.As you can imagine, when this happens the server is effectively bring down by having a never-ending queue of jobs that are doing
df
.This should never happen in the first place because if a probing call never finished the system should never start a new one, preventing such hammering operations.
I should add that one side effect of this is that the datadog reporting stops few hours before the machine goes down. The load on the machine is huge, like ~500 but the CPU is not loaded at all because all these. All the stuck jobs are in state "D" - disk sleep.
Also, I should mention that this happened 2-3 times during the last months, on the same machine (just because that's the one that is doing a lot of NFS work).
The text was updated successfully, but these errors were encountered: