Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker_container_cpu metrics does not manage well per_cpu stats (online_cpus) in presence of offline CPUs #2964

Closed
eesprit opened this issue Jun 26, 2017 · 4 comments · Fixed by #3035
Milestone

Comments

@eesprit
Copy link
Contributor

eesprit commented Jun 26, 2017

Bug report

When used with recent kernels in presence of offline CPUs (with some virtualization systems by example), docker_container_cpu metrics could be wrongs :

  • cpu percentage might be wrong
  • too much metrics (per cpu stats) might be stored

This is due to not taking the "online_cpus" info returns by recent docker engine stats API into account.

Relevant telegraf.conf:

[[inputs.docker]]
  ## Docker Endpoint
  ##   To use TCP, set endpoint = "tcp://[ip]:[port]"
  ##   To use environment variables (ie, docker-machine), set endpoint = "ENV"
  endpoint = "unix:///var/run/docker.sock"
  ## Only collect metrics for these containers, collect all if empty
  container_names = []
  ## Timeout for docker list, info, and stats commands
  timeout = "5s"

  ## Whether to report for each container per-device blkio (8:0, 8:1...) and
  ## network (eth0, eth1, ...) stats or not
  perdevice = true
  ## Whether to report for each container total blkio and network stats or not
  total = false

System info:

Telegraf : 1.3.2-1
OS : Debian stretch / Linux kernel 4.9.0-3-amd64 / Hyper-V VM
Docker Engine : 17.05.0ce-0debian-stretch

Steps to reproduce:

  1. Start a Linux VM (with recent Linux Kernel > 4.7) with N vcpus on an Hyper-V Hypervisor (I don't know if other hypervisors are doing the same thing concerning unplugged CPUs, can't reproduce on Xen by example) with X physical CPUs (with X > N)
  2. Install Docker and Telegraf
  3. Collect stats from docker
  4. SHOW TAG VALUES FROM docker_container_cpu WITH KEY = "cpu";

Expected behavior:

the 4th command should show N differents CPU tags + cpu-total

Actual behavior:

the 4th command shows X CPUS tags (and usage percentage is wrong) + cpu-total

Additional info:

moby/moby#28941
moby/moby#26591
moby/moby#31579
moby/moby#31802

Example of docker stats obtained with a patched Docker engine :
https://pastebin.com/3JZt86MC

As you can see, there are 240 cpu metrics (cpu_stats.cpu_usage.percpu_usage)
The number of real used CPUs is given by cpu_stats.online_cpus (4 in our case).

In this example, this results in storing 236 useless values, which generate a lot of storage overhead in influxDB.

And also the CPU percentage is wrong : I should multiply it by 60 (240/4) in grafana to have something accurate. Not really user-friendly, and not generic (should calculate it for each host depending on the number of vcpus).

@eesprit
Copy link
Contributor Author

eesprit commented Jul 19, 2017

I have started to work on a patch for that issue.

I have some question about the CPU percent usage calculation though.
Actually, it takes the cpu usage, and multiply it by (100 * number of cores).
So on a system with 4 cores, this means we could have up to 400% CPU usage.
Is that something wanted ? Or should it be a percent of the total of CPU (100% max) ?

The latter sounds more logical to me, as it refelects what telegraf sends for the (not docker) cpu measurement, but that would break current implementations of graph/dashboarding/alerting tools that probably already handles this metric that way.

Any thoughts on this ?

Another question : my patch needs to update the docker client Go library (which should be after moby/moby@115f91d which fixes the problem on the docker side).
Is there a recommandation on which docker client to select ?
Using the more recent stable one (v17.03.2-ce), the more recent one (v17.05.0-ce) or anything else ?

Thanks for your guidance.

@danielnelson
Copy link
Contributor

I think we should probably use the algorithm here: https://github.com/docker/cli/blob/master/cli/command/container/stats_helpers.go#L168. There are separate functions for Windows and Unix, see #2457

In general, we can update to the latest stable client library on any 1.X release so long as it passes the current tests. I'm a little confused with the new docker version numbers, is v17.05.0-ce not considered stable?

@eesprit
Copy link
Contributor Author

eesprit commented Jul 20, 2017

Yeah, new versioning scheme (+CE/EE distinction) is not really easy to follow.
https://i0.wp.com/blog.docker.com/wp-content/uploads/lifecycle.png?resize=1024%2C376&ssl=1
=> v17.03.X are kind of LTS (at least for 3 or 4 months), 17.05 is not. Might still be considered as stable though.

For CPU calculation, that's almost what I have done :
https://github.com/viareport/telegraf/commit/82d30aa7ccfc4d89fe3f9949b607cb4524674a1c#diff-691fdaed73e36497110e1ea47c1498f5R539

I might change the > 0 test for the '== 0.0' test variant, if you think that's better.

I have a last question. I wanted to include assertions in tests to check that in case of OnlineCPUs < n_cpus, the metrics for OfflineCPUs were not included. But the testing class does not include an AssertNotContainsTaggedFields method. Any thoughts on that ?

@eesprit
Copy link
Contributor Author

eesprit commented Jul 20, 2017

Never mind, I added an AssertDoesNotContainsTaggedFields method which will be in my pull request, you will be able to refuse the commits concerning the tests if you think it is unnecessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants