Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/hostmetrics] The value of process.cpu.utilization may exceed 1 #31368

Closed
BinaryFissionGames opened this issue Feb 21, 2024 · 2 comments · Fixed by #31378
Closed

[receiver/hostmetrics] The value of process.cpu.utilization may exceed 1 #31368

BinaryFissionGames opened this issue Feb 21, 2024 · 2 comments · Fixed by #31378
Labels
bug Something isn't working receiver/hostmetrics

Comments

@BinaryFissionGames
Copy link
Contributor

Component(s)

receiver/hostmetrics

What happened?

Description

The process.cpu.utilization metric is expected to be a value between 0 and 1, but it can be greater than 1.

My guess here is that process.cpu.utilization does not properly divide by the number of system cores, so the value can actually be between 0 - ${num_cores}

Steps to Reproduce

  1. Run a process with heavy load on a multi-core machine
  2. Scrape with the hostmetric receiver
  3. Note that process metrics may exceed 1

Expected Result

No metrics exceed 1 (in fact, I'd expect the sum of all processes to be < 1)

Actual Result

I'm getting a metric of 9.5 (this process was taking ~950% cpu in my activity monitor)

{
  "resource": {
    "attributes": [
      { "key": "process.pid", "value": { "intValue": "89633" } },
      { "key": "process.parent_pid", "value": { "intValue": "89617" } },
      {
        "key": "process.executable.name",
        "value": { "stringValue": "load" }
      },
      {
        "key": "process.executable.path",
        "value": {
          "stringValue": "/var/folders/qn/13392c0n4rl4j4cplc65_66h0000gn/T/go-build2476508403/b001/exe/load"
        }
      },
      {
        "key": "process.command",
        "value": {
          "stringValue": "/var/folders/qn/13392c0n4rl4j4cplc65_66h0000gn/T/go-build2476508403/b001/exe/load"
        }
      },
      {
        "key": "process.command_line",
        "value": {
          "stringValue": "/var/folders/qn/13392c0n4rl4j4cplc65_66h0000gn/T/go-build2476508403/b001/exe/load"
        }
      },
      {
        "key": "process.owner",
        "value": { "stringValue": "brandonjohnson" }
      },
      { "key": "host.name", "value": { "stringValue": "Brandons-MBP" } },
      { "key": "os.type", "value": { "stringValue": "darwin" } }
    ]
  },
  "scopeMetrics": [
    {
      "scope": {
        "name": "otelcol/hostmetricsreceiver/process",
        "version": "v1.44.0"
      },
      "metrics": [
        {
          "name": "process.cpu.utilization",
          "description": "Percentage of total CPU time used by the process since last scrape, expressed as a value between 0 and 1. On the first scrape, no data point is emitted for this metric.",
          "unit": "1",
          "gauge": {
            "dataPoints": [
              {
                "attributes": [
                  { "key": "state", "value": { "stringValue": "user" } }
                ],
                "startTimeUnixNano": "1708529361162000000",
                "timeUnixNano": "1708529423825350000",
                "asDouble": 9.532629437655167
              },
              {
                "attributes": [
                  { "key": "state", "value": { "stringValue": "system" } }
                ],
                "startTimeUnixNano": "1708529361162000000",
                "timeUnixNano": "1708529423825350000",
                "asDouble": 0.07040537266051189
              },
              {
                "attributes": [
                  { "key": "state", "value": { "stringValue": "wait" } }
                ],
                "startTimeUnixNano": "1708529361162000000",
                "timeUnixNano": "1708529423825350000",
                "asDouble": 0
              }
            ]
          }
        }
      ]
    }
  ],
  "schemaUrl": "https://opentelemetry.io/schemas/1.9.0"
}

Collector version

v0.94.0

Environment information

Environment

OS: macOS 14.2.1
Compiler(if manually compiled): go 1.22

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

I used this quick go program to quickly generate load:

package main

import "time"

func main() {
	for i := 0; i < 100; i++ {
		go func() {
			v := 0
			for {
				v++
			}
		}()
	}

	time.Sleep(3 * time.Minute)
}
@BinaryFissionGames BinaryFissionGames added bug Something isn't working needs triage New item requiring triage labels Feb 21, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1
Copy link
Member

Removing needs triage based on comments in PR.

@crobert-1 crobert-1 removed the needs triage New item requiring triage label Feb 26, 2024
djaglowski added a commit that referenced this issue Mar 12, 2024
…ss.cpu.utilization (#31378)

**Description:**
When calculating the process.cpu.utilization metric, values over 1 were
possible since the number of cores was not taken into account (a single
process may run on multiple logical cores, this effectively multplying
the maximum amount of CPU time the process may take).

This PR adds a division by the number of logical cores to the
calculation for cpu utilization.

**Link to tracking Issue:** Closes #31368

**Testing:**
* Added some unit tests
* Tested locally on my system with the program I posted in the issue:

```json
{
  "name": "process.cpu.utilization",
  "description": "Percentage of total CPU time used by the process since last scrape, expressed as a value between 0 and 1. On the first scrape, no data point is emitted for this metric.",
  "unit": "1",
  "gauge": {
    "dataPoints": [
      {
        "attributes": [{ "key": "state", "value": { "stringValue": "user" } }],
        "startTimeUnixNano": "1708562810521000000",
        "timeUnixNano": "1708562890771386000",
        "asDouble": 0.8811268516953904
      },
      {
        "attributes": [
          { "key": "state", "value": { "stringValue": "system" } }
        ],
        "startTimeUnixNano": "1708562810521000000",
        "timeUnixNano": "1708562890771386000",
        "asDouble": 0.0029471002907659667
      },
      {
        "attributes": [{ "key": "state", "value": { "stringValue": "wait" } }],
        "startTimeUnixNano": "1708562810521000000",
        "timeUnixNano": "1708562890771386000",
        "asDouble": 0
      }
    ]
  }
}
```

In activity monitor, this process was clocking in around ~1000% - ~1100%
cpu, on my machine that has 12 logical cores. So the value of around 90%
total utilization seems correct here.

**Documentation:**
N/A

---------

Co-authored-by: Daniel Jaglowski <jaglows3@gmail.com>
DougManton pushed a commit to DougManton/opentelemetry-collector-contrib that referenced this issue Mar 13, 2024
…ss.cpu.utilization (open-telemetry#31378)

**Description:**
When calculating the process.cpu.utilization metric, values over 1 were
possible since the number of cores was not taken into account (a single
process may run on multiple logical cores, this effectively multplying
the maximum amount of CPU time the process may take).

This PR adds a division by the number of logical cores to the
calculation for cpu utilization.

**Link to tracking Issue:** Closes open-telemetry#31368

**Testing:**
* Added some unit tests
* Tested locally on my system with the program I posted in the issue:

```json
{
  "name": "process.cpu.utilization",
  "description": "Percentage of total CPU time used by the process since last scrape, expressed as a value between 0 and 1. On the first scrape, no data point is emitted for this metric.",
  "unit": "1",
  "gauge": {
    "dataPoints": [
      {
        "attributes": [{ "key": "state", "value": { "stringValue": "user" } }],
        "startTimeUnixNano": "1708562810521000000",
        "timeUnixNano": "1708562890771386000",
        "asDouble": 0.8811268516953904
      },
      {
        "attributes": [
          { "key": "state", "value": { "stringValue": "system" } }
        ],
        "startTimeUnixNano": "1708562810521000000",
        "timeUnixNano": "1708562890771386000",
        "asDouble": 0.0029471002907659667
      },
      {
        "attributes": [{ "key": "state", "value": { "stringValue": "wait" } }],
        "startTimeUnixNano": "1708562810521000000",
        "timeUnixNano": "1708562890771386000",
        "asDouble": 0
      }
    ]
  }
}
```

In activity monitor, this process was clocking in around ~1000% - ~1100%
cpu, on my machine that has 12 logical cores. So the value of around 90%
total utilization seems correct here.

**Documentation:**
N/A

---------

Co-authored-by: Daniel Jaglowski <jaglows3@gmail.com>
XinRanZhAWS pushed a commit to XinRanZhAWS/opentelemetry-collector-contrib that referenced this issue Mar 13, 2024
…ss.cpu.utilization (open-telemetry#31378)

**Description:**
When calculating the process.cpu.utilization metric, values over 1 were
possible since the number of cores was not taken into account (a single
process may run on multiple logical cores, this effectively multplying
the maximum amount of CPU time the process may take).

This PR adds a division by the number of logical cores to the
calculation for cpu utilization.

**Link to tracking Issue:** Closes open-telemetry#31368

**Testing:**
* Added some unit tests
* Tested locally on my system with the program I posted in the issue:

```json
{
  "name": "process.cpu.utilization",
  "description": "Percentage of total CPU time used by the process since last scrape, expressed as a value between 0 and 1. On the first scrape, no data point is emitted for this metric.",
  "unit": "1",
  "gauge": {
    "dataPoints": [
      {
        "attributes": [{ "key": "state", "value": { "stringValue": "user" } }],
        "startTimeUnixNano": "1708562810521000000",
        "timeUnixNano": "1708562890771386000",
        "asDouble": 0.8811268516953904
      },
      {
        "attributes": [
          { "key": "state", "value": { "stringValue": "system" } }
        ],
        "startTimeUnixNano": "1708562810521000000",
        "timeUnixNano": "1708562890771386000",
        "asDouble": 0.0029471002907659667
      },
      {
        "attributes": [{ "key": "state", "value": { "stringValue": "wait" } }],
        "startTimeUnixNano": "1708562810521000000",
        "timeUnixNano": "1708562890771386000",
        "asDouble": 0
      }
    ]
  }
}
```

In activity monitor, this process was clocking in around ~1000% - ~1100%
cpu, on my machine that has 12 logical cores. So the value of around 90%
total utilization seems correct here.

**Documentation:**
N/A

---------

Co-authored-by: Daniel Jaglowski <jaglows3@gmail.com>
andrzej-stencel added a commit that referenced this issue May 6, 2024
…alizeProcessCPUUtilization` (#32502)

**Description:**

Switches the `receiver.hostmetrics.normalizeProcessCPUUtilization`
feature gate to Beta, making it enabled by default.
This is according to schedule described in the
[docs](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.98.0/receiver/hostmetricsreceiver/README.md#feature-gates).

**Link to tracking Issue:**

-
#31368

Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working receiver/hostmetrics
Projects
None yet
2 participants