docker_container_cpu metrics does not manage well per_cpu stats (online_cpus) in presence of offline CPUs #2964

eesprit · 2017-06-26T12:36:22Z

Bug report

When used with recent kernels in presence of offline CPUs (with some virtualization systems by example), docker_container_cpu metrics could be wrongs :

cpu percentage might be wrong
too much metrics (per cpu stats) might be stored

This is due to not taking the "online_cpus" info returns by recent docker engine stats API into account.

Relevant telegraf.conf:

[[inputs.docker]]
  ## Docker Endpoint
  ##   To use TCP, set endpoint = "tcp://[ip]:[port]"
  ##   To use environment variables (ie, docker-machine), set endpoint = "ENV"
  endpoint = "unix:///var/run/docker.sock"
  ## Only collect metrics for these containers, collect all if empty
  container_names = []
  ## Timeout for docker list, info, and stats commands
  timeout = "5s"

  ## Whether to report for each container per-device blkio (8:0, 8:1...) and
  ## network (eth0, eth1, ...) stats or not
  perdevice = true
  ## Whether to report for each container total blkio and network stats or not
  total = false

System info:

Telegraf : 1.3.2-1
OS : Debian stretch / Linux kernel 4.9.0-3-amd64 / Hyper-V VM
Docker Engine : 17.05.0~~ce-0~~debian-stretch

Steps to reproduce:

Start a Linux VM (with recent Linux Kernel > 4.7) with N vcpus on an Hyper-V Hypervisor (I don't know if other hypervisors are doing the same thing concerning unplugged CPUs, can't reproduce on Xen by example) with X physical CPUs (with X > N)
Install Docker and Telegraf
Collect stats from docker
SHOW TAG VALUES FROM docker_container_cpu WITH KEY = "cpu";

Expected behavior:

the 4th command should show N differents CPU tags + cpu-total

Actual behavior:

the 4th command shows X CPUS tags (and usage percentage is wrong) + cpu-total

Additional info:

moby/moby#28941
moby/moby#26591
moby/moby#31579
moby/moby#31802

Example of docker stats obtained with a patched Docker engine :
https://pastebin.com/3JZt86MC

As you can see, there are 240 cpu metrics (cpu_stats.cpu_usage.percpu_usage)
The number of real used CPUs is given by cpu_stats.online_cpus (4 in our case).

In this example, this results in storing 236 useless values, which generate a lot of storage overhead in influxDB.

And also the CPU percentage is wrong : I should multiply it by 60 (240/4) in grafana to have something accurate. Not really user-friendly, and not generic (should calculate it for each host depending on the number of vcpus).

The text was updated successfully, but these errors were encountered:

eesprit · 2017-07-19T12:13:17Z

I have started to work on a patch for that issue.

I have some question about the CPU percent usage calculation though.
Actually, it takes the cpu usage, and multiply it by (100 * number of cores).
So on a system with 4 cores, this means we could have up to 400% CPU usage.
Is that something wanted ? Or should it be a percent of the total of CPU (100% max) ?

The latter sounds more logical to me, as it refelects what telegraf sends for the (not docker) cpu measurement, but that would break current implementations of graph/dashboarding/alerting tools that probably already handles this metric that way.

Any thoughts on this ?

Another question : my patch needs to update the docker client Go library (which should be after moby/moby@115f91d which fixes the problem on the docker side).
Is there a recommandation on which docker client to select ?
Using the more recent stable one (v17.03.2-ce), the more recent one (v17.05.0-ce) or anything else ?

Thanks for your guidance.

danielnelson · 2017-07-19T22:11:08Z

I think we should probably use the algorithm here: https://github.com/docker/cli/blob/master/cli/command/container/stats_helpers.go#L168. There are separate functions for Windows and Unix, see #2457

In general, we can update to the latest stable client library on any 1.X release so long as it passes the current tests. I'm a little confused with the new docker version numbers, is v17.05.0-ce not considered stable?

eesprit · 2017-07-20T07:43:45Z

Yeah, new versioning scheme (+CE/EE distinction) is not really easy to follow.
https://i0.wp.com/blog.docker.com/wp-content/uploads/lifecycle.png?resize=1024%2C376&ssl=1
=> v17.03.X are kind of LTS (at least for 3 or 4 months), 17.05 is not. Might still be considered as stable though.

For CPU calculation, that's almost what I have done :
https://github.com/viareport/telegraf/commit/82d30aa7ccfc4d89fe3f9949b607cb4524674a1c#diff-691fdaed73e36497110e1ea47c1498f5R539

I might change the > 0 test for the '== 0.0' test variant, if you think that's better.

I have a last question. I wanted to include assertions in tests to check that in case of OnlineCPUs < n_cpus, the metrics for OfflineCPUs were not included. But the testing class does not include an AssertNotContainsTaggedFields method. Any thoughts on that ?

eesprit · 2017-07-20T10:56:56Z

Never mind, I added an AssertDoesNotContainsTaggedFields method which will be in my pull request, you will be able to refuse the commits concerning the tests if you think it is unnecessary.

danielnelson added the area/docker label Jun 26, 2017

eesprit mentioned this issue Jul 20, 2017

Online cpus #3035

Merged

danielnelson closed this as completed in #3035 Jul 21, 2017

danielnelson added this to the 1.4.0 milestone Jul 21, 2017

aboguszewski-sumo mentioned this issue Mar 6, 2024

feat(dockerstats): add online cpu and memory failcnt metrics open-telemetry/opentelemetry-collector-contrib#31481

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docker_container_cpu metrics does not manage well per_cpu stats (online_cpus) in presence of offline CPUs #2964

docker_container_cpu metrics does not manage well per_cpu stats (online_cpus) in presence of offline CPUs #2964

eesprit commented Jun 26, 2017

eesprit commented Jul 19, 2017

danielnelson commented Jul 19, 2017

eesprit commented Jul 20, 2017

eesprit commented Jul 20, 2017

docker_container_cpu metrics does not manage well per_cpu stats (online_cpus) in presence of offline CPUs #2964

docker_container_cpu metrics does not manage well per_cpu stats (online_cpus) in presence of offline CPUs #2964

Comments

eesprit commented Jun 26, 2017

Bug report

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

eesprit commented Jul 19, 2017

danielnelson commented Jul 19, 2017

eesprit commented Jul 20, 2017

eesprit commented Jul 20, 2017