Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datadog agent is taking server down by building queues of df -k on high load/slow NFS file systems #1931

Closed
ssbarnea opened this issue Sep 18, 2015 · 2 comments

Comments

@ssbarnea
Copy link

It seems that datadog-agent is using df -k to monitor disk space usage and this is a time consuming operation that, in certain conditions, could take more than the time between two checks.

As you can imagine, when this happens the server is effectively bring down by having a never-ending queue of jobs that are doing df.

This should never happen in the first place because if a probing call never finished the system should never start a new one, preventing such hammering operations.

I should add that one side effect of this is that the datadog reporting stops few hours before the machine goes down. The load on the machine is huge, like ~500 but the CPU is not loaded at all because all these. All the stuck jobs are in state "D" - disk sleep.

dd-agent info
===================
Collector (v 5.1.1)
===================

  Status date: 2015-09-18 15:19:25 (12s ago)
  Pid: 2219
  Platform: Linux-3.2.0-4-amd64-x86_64-with-debian-7.8
  Python Version: 2.7.8

Also, I should mention that this happened 2-3 times during the last months, on the same machine (just because that's the one that is doing a lot of NFS work).

@yannmh
Copy link
Member

yannmh commented Sep 18, 2015

Hi @ssbarnea,

While we are still using df to compute monitor disk space usage, the issue was addressed in the 5.4.6 agent release.

More specifically: DataDog/gohai#16. Would you mind upgrading and tell us if you are still experiencing the issue please ? Thanks!

@olivielpeau
Copy link
Member

@ssbarnea I'm closing this issue as the fix mentioned above should have resolved it, but feel free to re-open it if you still have issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants