[receiver/hostmetrics] The value of process.cpu.utilization may exceed 1 #31368

BinaryFissionGames · 2024-02-21T15:45:14Z

Component(s)

receiver/hostmetrics

What happened?

Description

The process.cpu.utilization metric is expected to be a value between 0 and 1, but it can be greater than 1.

My guess here is that process.cpu.utilization does not properly divide by the number of system cores, so the value can actually be between 0 - ${num_cores}

Steps to Reproduce

Run a process with heavy load on a multi-core machine
Scrape with the hostmetric receiver
Note that process metrics may exceed 1

Expected Result

No metrics exceed 1 (in fact, I'd expect the sum of all processes to be < 1)

Actual Result

I'm getting a metric of 9.5 (this process was taking ~950% cpu in my activity monitor)

{
  "resource": {
    "attributes": [
      { "key": "process.pid", "value": { "intValue": "89633" } },
      { "key": "process.parent_pid", "value": { "intValue": "89617" } },
      {
        "key": "process.executable.name",
        "value": { "stringValue": "load" }
      },
      {
        "key": "process.executable.path",
        "value": {
          "stringValue": "/var/folders/qn/13392c0n4rl4j4cplc65_66h0000gn/T/go-build2476508403/b001/exe/load"
        }
      },
      {
        "key": "process.command",
        "value": {
          "stringValue": "/var/folders/qn/13392c0n4rl4j4cplc65_66h0000gn/T/go-build2476508403/b001/exe/load"
        }
      },
      {
        "key": "process.command_line",
        "value": {
          "stringValue": "/var/folders/qn/13392c0n4rl4j4cplc65_66h0000gn/T/go-build2476508403/b001/exe/load"
        }
      },
      {
        "key": "process.owner",
        "value": { "stringValue": "brandonjohnson" }
      },
      { "key": "host.name", "value": { "stringValue": "Brandons-MBP" } },
      { "key": "os.type", "value": { "stringValue": "darwin" } }
    ]
  },
  "scopeMetrics": [
    {
      "scope": {
        "name": "otelcol/hostmetricsreceiver/process",
        "version": "v1.44.0"
      },
      "metrics": [
        {
          "name": "process.cpu.utilization",
          "description": "Percentage of total CPU time used by the process since last scrape, expressed as a value between 0 and 1. On the first scrape, no data point is emitted for this metric.",
          "unit": "1",
          "gauge": {
            "dataPoints": [
              {
                "attributes": [
                  { "key": "state", "value": { "stringValue": "user" } }
                ],
                "startTimeUnixNano": "1708529361162000000",
                "timeUnixNano": "1708529423825350000",
                "asDouble": 9.532629437655167
              },
              {
                "attributes": [
                  { "key": "state", "value": { "stringValue": "system" } }
                ],
                "startTimeUnixNano": "1708529361162000000",
                "timeUnixNano": "1708529423825350000",
                "asDouble": 0.07040537266051189
              },
              {
                "attributes": [
                  { "key": "state", "value": { "stringValue": "wait" } }
                ],
                "startTimeUnixNano": "1708529361162000000",
                "timeUnixNano": "1708529423825350000",
                "asDouble": 0
              }
            ]
          }
        }
      ]
    }
  ],
  "schemaUrl": "https://opentelemetry.io/schemas/1.9.0"
}

Collector version

v0.94.0

Environment information

Environment

OS: macOS 14.2.1
Compiler(if manually compiled): go 1.22

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

I used this quick go program to quickly generate load:

package main

import "time"

func main() {
	for i := 0; i < 100; i++ {
		go func() {
			v := 0
			for {
				v++
			}
		}()
	}

	time.Sleep(3 * time.Minute)
}

The text was updated successfully, but these errors were encountered:

github-actions · 2024-02-21T15:51:44Z

Pinging code owners:

receiver/hostmetrics: @dmitryax @braydonk

See Adding Labels via Comments if you do not have permissions to add labels yourself.

crobert-1 · 2024-02-26T20:41:10Z

Removing needs triage based on comments in PR.

…ss.cpu.utilization (#31378) **Description:** When calculating the process.cpu.utilization metric, values over 1 were possible since the number of cores was not taken into account (a single process may run on multiple logical cores, this effectively multplying the maximum amount of CPU time the process may take). This PR adds a division by the number of logical cores to the calculation for cpu utilization. **Link to tracking Issue:** Closes #31368 **Testing:** * Added some unit tests * Tested locally on my system with the program I posted in the issue: ```json { "name": "process.cpu.utilization", "description": "Percentage of total CPU time used by the process since last scrape, expressed as a value between 0 and 1. On the first scrape, no data point is emitted for this metric.", "unit": "1", "gauge": { "dataPoints": [ { "attributes": [{ "key": "state", "value": { "stringValue": "user" } }], "startTimeUnixNano": "1708562810521000000", "timeUnixNano": "1708562890771386000", "asDouble": 0.8811268516953904 }, { "attributes": [ { "key": "state", "value": { "stringValue": "system" } } ], "startTimeUnixNano": "1708562810521000000", "timeUnixNano": "1708562890771386000", "asDouble": 0.0029471002907659667 }, { "attributes": [{ "key": "state", "value": { "stringValue": "wait" } }], "startTimeUnixNano": "1708562810521000000", "timeUnixNano": "1708562890771386000", "asDouble": 0 } ] } } ``` In activity monitor, this process was clocking in around ~1000% - ~1100% cpu, on my machine that has 12 logical cores. So the value of around 90% total utilization seems correct here. **Documentation:** N/A --------- Co-authored-by: Daniel Jaglowski <jaglows3@gmail.com>

…ss.cpu.utilization (open-telemetry#31378) **Description:** When calculating the process.cpu.utilization metric, values over 1 were possible since the number of cores was not taken into account (a single process may run on multiple logical cores, this effectively multplying the maximum amount of CPU time the process may take). This PR adds a division by the number of logical cores to the calculation for cpu utilization. **Link to tracking Issue:** Closes open-telemetry#31368 **Testing:** * Added some unit tests * Tested locally on my system with the program I posted in the issue: ```json { "name": "process.cpu.utilization", "description": "Percentage of total CPU time used by the process since last scrape, expressed as a value between 0 and 1. On the first scrape, no data point is emitted for this metric.", "unit": "1", "gauge": { "dataPoints": [ { "attributes": [{ "key": "state", "value": { "stringValue": "user" } }], "startTimeUnixNano": "1708562810521000000", "timeUnixNano": "1708562890771386000", "asDouble": 0.8811268516953904 }, { "attributes": [ { "key": "state", "value": { "stringValue": "system" } } ], "startTimeUnixNano": "1708562810521000000", "timeUnixNano": "1708562890771386000", "asDouble": 0.0029471002907659667 }, { "attributes": [{ "key": "state", "value": { "stringValue": "wait" } }], "startTimeUnixNano": "1708562810521000000", "timeUnixNano": "1708562890771386000", "asDouble": 0 } ] } } ``` In activity monitor, this process was clocking in around ~1000% - ~1100% cpu, on my machine that has 12 logical cores. So the value of around 90% total utilization seems correct here. **Documentation:** N/A --------- Co-authored-by: Daniel Jaglowski <jaglows3@gmail.com>

…alizeProcessCPUUtilization` (#32502) **Description:** Switches the `receiver.hostmetrics.normalizeProcessCPUUtilization` feature gate to Beta, making it enabled by default. This is according to schedule described in the [docs](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.98.0/receiver/hostmetricsreceiver/README.md#feature-gates). **Link to tracking Issue:** - #31368 Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>

BinaryFissionGames added bug Something isn't working needs triage New item requiring triage labels Feb 21, 2024

github-actions bot added the receiver/hostmetrics label Feb 21, 2024

BinaryFissionGames mentioned this issue Feb 22, 2024

[receiver/hostmetrics] Divide by logical cores when calculating process.cpu.utilization #31378

Merged

crobert-1 removed the needs triage New item requiring triage label Feb 26, 2024

This was referenced Feb 27, 2024

Weekly Report: 2024-02-20 - 2024-02-27 #31422

Closed

Weekly Report: 2024-02-20 - 2024-02-27 asuresh4/opentelemetry-collector-contrib#11542

Open

djaglowski closed this as completed in #31378 Mar 12, 2024

andrzej-stencel mentioned this issue Apr 18, 2024

[receiver/hostmetrics] enable feature gate receiver.hostmetrics.normalizeProcessCPUUtilization #32502

Merged

github-actions bot mentioned this issue Jul 1, 2024

Link Checker Report signalfx/splunk-otel-collector#5039

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[receiver/hostmetrics] The value of process.cpu.utilization may exceed 1 #31368

[receiver/hostmetrics] The value of process.cpu.utilization may exceed 1 #31368

BinaryFissionGames commented Feb 21, 2024

github-actions bot commented Feb 21, 2024

crobert-1 commented Feb 26, 2024