datadog agent is taking server down by building queues of `df -k` on high load/slow NFS file systems #1931

ssbarnea · 2015-09-18T14:16:24Z

It seems that datadog-agent is using df -k to monitor disk space usage and this is a time consuming operation that, in certain conditions, could take more than the time between two checks.

As you can imagine, when this happens the server is effectively bring down by having a never-ending queue of jobs that are doing df.

This should never happen in the first place because if a probing call never finished the system should never start a new one, preventing such hammering operations.

I should add that one side effect of this is that the datadog reporting stops few hours before the machine goes down. The load on the machine is huge, like ~500 but the CPU is not loaded at all because all these. All the stuck jobs are in state "D" - disk sleep.

dd-agent info
===================
Collector (v 5.1.1)
===================

  Status date: 2015-09-18 15:19:25 (12s ago)
  Pid: 2219
  Platform: Linux-3.2.0-4-amd64-x86_64-with-debian-7.8
  Python Version: 2.7.8

Also, I should mention that this happened 2-3 times during the last months, on the same machine (just because that's the one that is doing a lot of NFS work).

The text was updated successfully, but these errors were encountered:

yannmh · 2015-09-18T18:17:07Z

Hi @ssbarnea,

While we are still using df to compute monitor disk space usage, the issue was addressed in the 5.4.6 agent release.

More specifically: DataDog/gohai#16. Would you mind upgrading and tell us if you are still experiencing the issue please ? Thanks!

olivielpeau · 2015-12-28T15:23:12Z

@ssbarnea I'm closing this issue as the fix mentioned above should have resolved it, but feel free to re-open it if you still have issues.

olivielpeau closed this as completed Dec 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datadog agent is taking server down by building queues of `df -k` on high load/slow NFS file systems #1931

datadog agent is taking server down by building queues of `df -k` on high load/slow NFS file systems #1931

ssbarnea commented Sep 18, 2015

yannmh commented Sep 18, 2015

olivielpeau commented Dec 28, 2015

datadog agent is taking server down by building queues of df -k on high load/slow NFS file systems #1931

datadog agent is taking server down by building queues of df -k on high load/slow NFS file systems #1931

Comments

ssbarnea commented Sep 18, 2015

yannmh commented Sep 18, 2015

olivielpeau commented Dec 28, 2015

datadog agent is taking server down by building queues of `df -k` on high load/slow NFS file systems #1931

datadog agent is taking server down by building queues of `df -k` on high load/slow NFS file systems #1931