-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docker_container_cpu metrics does not manage well per_cpu stats (online_cpus) in presence of offline CPUs #2964
Comments
I have started to work on a patch for that issue. I have some question about the CPU percent usage calculation though. The latter sounds more logical to me, as it refelects what telegraf sends for the (not docker) cpu measurement, but that would break current implementations of graph/dashboarding/alerting tools that probably already handles this metric that way. Any thoughts on this ? Another question : my patch needs to update the docker client Go library (which should be after moby/moby@115f91d which fixes the problem on the docker side). Thanks for your guidance. |
I think we should probably use the algorithm here: https://github.com/docker/cli/blob/master/cli/command/container/stats_helpers.go#L168. There are separate functions for Windows and Unix, see #2457 In general, we can update to the latest stable client library on any 1.X release so long as it passes the current tests. I'm a little confused with the new docker version numbers, is v17.05.0-ce not considered stable? |
Yeah, new versioning scheme (+CE/EE distinction) is not really easy to follow. For CPU calculation, that's almost what I have done : I might change the I have a last question. I wanted to include assertions in tests to check that in case of OnlineCPUs < n_cpus, the metrics for OfflineCPUs were not included. But the testing class does not include an AssertNotContainsTaggedFields method. Any thoughts on that ? |
Never mind, I added an AssertDoesNotContainsTaggedFields method which will be in my pull request, you will be able to refuse the commits concerning the tests if you think it is unnecessary. |
Bug report
When used with recent kernels in presence of offline CPUs (with some virtualization systems by example), docker_container_cpu metrics could be wrongs :
This is due to not taking the "online_cpus" info returns by recent docker engine stats API into account.
Relevant telegraf.conf:
System info:
Telegraf : 1.3.2-1
OS : Debian stretch / Linux kernel 4.9.0-3-amd64 / Hyper-V VM
Docker Engine : 17.05.0
ce-0debian-stretchSteps to reproduce:
Expected behavior:
the 4th command should show N differents CPU tags + cpu-total
Actual behavior:
the 4th command shows X CPUS tags (and usage percentage is wrong) + cpu-total
Additional info:
moby/moby#28941
moby/moby#26591
moby/moby#31579
moby/moby#31802
Example of docker stats obtained with a patched Docker engine :
https://pastebin.com/3JZt86MC
As you can see, there are 240 cpu metrics (cpu_stats.cpu_usage.percpu_usage)
The number of real used CPUs is given by cpu_stats.online_cpus (4 in our case).
In this example, this results in storing 236 useless values, which generate a lot of storage overhead in influxDB.
And also the CPU percentage is wrong : I should multiply it by 60 (240/4) in grafana to have something accurate. Not really user-friendly, and not generic (should calculate it for each host depending on the number of vcpus).
The text was updated successfully, but these errors were encountered: