Inconsistent CPU Percentage Calculation (Process vs System) #4468

PhaedrusTheGreek · 2017-06-06T21:09:14Z

The process times are collected using GetProcessTimes. The descriptions of the out params say that the times reported are summed across cores (so you can get greater than 100% usage). The code used by Metricbeat is here.

The overall system CPU time is collected using GetSystemTimes. The behavior is similar to GetProcessTimes. The documentation states, "On a multiprocessor system, the values returned are the sum of the designated times across all processors."

[There is] a difference between overall CPU usage and process CPU usage. In the overall CPU usage calculation the total time value is calculates by summing the parts (i.e. idle + kernel + user). In the process CPU calculation the total time is measured using the difference in wall clock times between samples. Assuming you want 100% to be the max, using wall clock time causes the percentage to be wrong for multi-core systems and inconsistent with the overall CPU percentage value.

I think we need a change to make percentages be consistent so that they can be compared. We need to decide if we want 100% to be max or if we want 100% * number_of_cores to be the max.

The text was updated successfully, but these errors were encountered:

tsg · 2017-06-06T21:11:17Z

Thanks @andrewkroh for the great analysis and find. If we make it 100% * number_of_cores that would be more consistent with what we have on Linux, right? I'd vote for that, in that case, so that people can compare the values across systems.

tsg · 2017-06-06T21:12:14Z

I changed this to bug so we tackle it for 6.0-GA.

andrewkroh · 2017-06-06T21:25:41Z

I'd vote for 100% being the max. I think it would be easier to interpret the data because it doesn't require any knowledge of the number of cores (a value that's not reported AFAIK unless you use correlate it to the cores metricset data). Using 100% as max would normalize the data so you can compare values across all systems irregardless of the core count. The downside being that you have lost some sense of the magnitude (2 cores maxed out vs 22 cores maxed out).

Irregardless of the final decision, I think it would be useful to include number of cores as a metric in any CPU related metricsets.

tsg · 2017-06-08T10:53:53Z

After the discussion we had yesterday, my vote would be to support both use cases "natively". Perhaps in the process metricset we could have two metrics:

system.process.cpu.total.pct - this one can go over 100%, like in top
system.process.cpu.total.normalized.pct. Defined as cpu.total.pct / number_of cores. It's max 100%.

My understanding is that this would be backwards compatible since in the current version system.process.cpu.total.pct can go over 100% on all platforms.

With the above, the cpu metricset (system wide) should use the same convetions but there are two issues:

it would be a BWC change for the cpu.total.pct field, at least on Windows (is it on all platforms?)
There are multiple CPU times in that metricset data.json. Adding two values for each would cause a significant increase in disk space.

Perhaps we could add the normalized values behind an option, like we do for ticks?

Regardless, I think we should also export system.cpu.number_of_cores as a new metric.

PhaedrusTheGreek · 2017-06-09T20:26:11Z

Any chance this can be backported to 5.x ?

tsg · 2017-06-12T10:01:38Z

Hmm, perhaps we can backport the non-BWC bits. Let's first have a concrete PR and we can discuss on it.

andrewkroh · 2017-06-12T19:43:49Z

it would be a BWC change for the cpu.total.pct field, at least on Windows (is it on all platforms?)

Yeah, this would affect all platforms.

andrewkroh · 2017-06-21T15:27:30Z

I just noticed that we use norm in the load metricset for the normalized load values. It would be inconsistent to use normalized. Should we

change load to use normalized,
use norm in cpu, core, and process,
or be inconsistent and not change load?

andrewkroh · 2017-06-21T22:26:08Z

Perhaps we could add the normalized values behind an option, like we do for ticks?

Instead of adding additional include_normalized or normalized.enable options. I propose we let the user specify a list so that they can pick and choose what to include. And this would deprecate the cpu_ticks option.

  load.metrics: [averages, normalized_averages]
  cpu.metrics:  [percentages, normalized_percentages, ticks]
  core.metrics: [percentages, ticks]

tsg · 2017-06-22T08:33:38Z

@ruflin pointed out that norm is in the guidelines: https://www.elastic.co/guide/en/beats/libbeat/current/event-conventions.html#abbreviations

Not of a fan of the abbreviation, but I think we should use norm consistently in this case.

Change all `system.cpu.*.pct` metrics to be scaled by the number of CPU cores such that the values range on `[0, 100% * number_of_cores]`. This will make the CPU usage percentages from the system cpu metricset consistent with the system process metricset. The documentation for these metrics already stated that on multi-core systems the percentages could be greater than 100%. This makes the code match the docs, but does cause a change in behavior to the user. elastic#4468

tsg · 2017-06-30T09:46:37Z

I think this can be closed.

andrewkroh · 2017-08-09T15:04:23Z

Related PRs:

5.5 - #4544
6.0 (master) - #4550 (backport) #4553 (add normalized values to 6.0)

PhaedrusTheGreek added enhancement Metricbeat Metricbeat labels Jun 6, 2017

tsg added bug v6.0.0-alpha2 and removed enhancement labels Jun 6, 2017

tsg closed this as completed Jun 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent CPU Percentage Calculation (Process vs System) #4468

Inconsistent CPU Percentage Calculation (Process vs System) #4468

PhaedrusTheGreek commented Jun 6, 2017

tsg commented Jun 6, 2017

tsg commented Jun 6, 2017

andrewkroh commented Jun 6, 2017

tsg commented Jun 8, 2017

PhaedrusTheGreek commented Jun 9, 2017

tsg commented Jun 12, 2017

andrewkroh commented Jun 12, 2017

andrewkroh commented Jun 21, 2017

andrewkroh commented Jun 21, 2017 •

edited

Loading

tsg commented Jun 22, 2017 •

edited

Loading

tsg commented Jun 30, 2017 •

edited

Loading

andrewkroh commented Aug 9, 2017

Inconsistent CPU Percentage Calculation (Process vs System) #4468

Inconsistent CPU Percentage Calculation (Process vs System) #4468

Comments

PhaedrusTheGreek commented Jun 6, 2017

tsg commented Jun 6, 2017

tsg commented Jun 6, 2017

andrewkroh commented Jun 6, 2017

tsg commented Jun 8, 2017

PhaedrusTheGreek commented Jun 9, 2017

tsg commented Jun 12, 2017

andrewkroh commented Jun 12, 2017

andrewkroh commented Jun 21, 2017

andrewkroh commented Jun 21, 2017 • edited Loading

tsg commented Jun 22, 2017 • edited Loading

tsg commented Jun 30, 2017 • edited Loading

andrewkroh commented Aug 9, 2017

andrewkroh commented Jun 21, 2017 •

edited

Loading

tsg commented Jun 22, 2017 •

edited

Loading

tsg commented Jun 30, 2017 •

edited

Loading