Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent CPU Percentage Calculation (Process vs System) #4468

Closed
PhaedrusTheGreek opened this issue Jun 6, 2017 · 12 comments
Closed

Inconsistent CPU Percentage Calculation (Process vs System) #4468

PhaedrusTheGreek opened this issue Jun 6, 2017 · 12 comments

Comments

@PhaedrusTheGreek
Copy link
Contributor

per @andrewkroh

The process times are collected using GetProcessTimes. The descriptions of the out params say that the times reported are summed across cores (so you can get greater than 100% usage). The code used by Metricbeat is here.

The overall system CPU time is collected using GetSystemTimes. The behavior is similar to GetProcessTimes. The documentation states, "On a multiprocessor system, the values returned are the sum of the designated times across all processors."

[There is] a difference between overall CPU usage and process CPU usage. In the overall CPU usage calculation the total time value is calculates by summing the parts (i.e. idle + kernel + user). In the process CPU calculation the total time is measured using the difference in wall clock times between samples. Assuming you want 100% to be the max, using wall clock time causes the percentage to be wrong for multi-core systems and inconsistent with the overall CPU percentage value.

I think we need a change to make percentages be consistent so that they can be compared. We need to decide if we want 100% to be max or if we want 100% * number_of_cores to be the max.

@tsg
Copy link
Contributor

tsg commented Jun 6, 2017

Thanks @andrewkroh for the great analysis and find. If we make it 100% * number_of_cores that would be more consistent with what we have on Linux, right? I'd vote for that, in that case, so that people can compare the values across systems.

@tsg
Copy link
Contributor

tsg commented Jun 6, 2017

I changed this to bug so we tackle it for 6.0-GA.

@andrewkroh
Copy link
Member

I'd vote for 100% being the max. I think it would be easier to interpret the data because it doesn't require any knowledge of the number of cores (a value that's not reported AFAIK unless you use correlate it to the cores metricset data). Using 100% as max would normalize the data so you can compare values across all systems irregardless of the core count. The downside being that you have lost some sense of the magnitude (2 cores maxed out vs 22 cores maxed out).

Irregardless of the final decision, I think it would be useful to include number of cores as a metric in any CPU related metricsets.

@tsg
Copy link
Contributor

tsg commented Jun 8, 2017

After the discussion we had yesterday, my vote would be to support both use cases "natively". Perhaps in the process metricset we could have two metrics:

  • system.process.cpu.total.pct - this one can go over 100%, like in top
  • system.process.cpu.total.normalized.pct. Defined as cpu.total.pct / number_of cores. It's max 100%.

My understanding is that this would be backwards compatible since in the current version system.process.cpu.total.pct can go over 100% on all platforms.

With the above, the cpu metricset (system wide) should use the same convetions but there are two issues:

  • it would be a BWC change for the cpu.total.pct field, at least on Windows (is it on all platforms?)
  • There are multiple CPU times in that metricset data.json. Adding two values for each would cause a significant increase in disk space.

Perhaps we could add the normalized values behind an option, like we do for ticks?

Regardless, I think we should also export system.cpu.number_of_cores as a new metric.

@PhaedrusTheGreek
Copy link
Contributor Author

Any chance this can be backported to 5.x ?

@tsg
Copy link
Contributor

tsg commented Jun 12, 2017

Hmm, perhaps we can backport the non-BWC bits. Let's first have a concrete PR and we can discuss on it.

@andrewkroh
Copy link
Member

it would be a BWC change for the cpu.total.pct field, at least on Windows (is it on all platforms?)

Yeah, this would affect all platforms.

@andrewkroh
Copy link
Member

I just noticed that we use norm in the load metricset for the normalized load values. It would be inconsistent to use normalized. Should we

  1. change load to use normalized,
  2. use norm in cpu, core, and process,
  3. or be inconsistent and not change load?

@andrewkroh
Copy link
Member

andrewkroh commented Jun 21, 2017

Perhaps we could add the normalized values behind an option, like we do for ticks?

Instead of adding additional include_normalized or normalized.enable options. I propose we let the user specify a list so that they can pick and choose what to include. And this would deprecate the cpu_ticks option.

  load.metrics: [averages, normalized_averages]
  cpu.metrics:  [percentages, normalized_percentages, ticks]
  core.metrics: [percentages, ticks]

@tsg
Copy link
Contributor

tsg commented Jun 22, 2017

@ruflin pointed out that norm is in the guidelines: https://www.elastic.co/guide/en/beats/libbeat/current/event-conventions.html#abbreviations

Not of a fan of the abbreviation, but I think we should use norm consistently in this case.

andrewkroh added a commit to andrewkroh/beats that referenced this issue Jun 22, 2017
Change all `system.cpu.*.pct` metrics to be scaled by the number of CPU cores such that the values range on `[0, 100% * number_of_cores]`. This will make the CPU usage percentages from the system cpu metricset consistent with the system process metricset. The documentation for these metrics already stated that on multi-core systems the percentages could be greater than 100%. This makes the code match the docs, but does cause a change in behavior to the user.

elastic#4468
@tsg
Copy link
Contributor

tsg commented Jun 30, 2017

I think this can be closed.

@tsg tsg closed this as completed Jun 30, 2017
@andrewkroh
Copy link
Member

Related PRs:

5.5 - #4544
6.0 (master) - #4550 (backport) #4553 (add normalized values to 6.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants