Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/hostmetricsreceiver] Gopsutil error on windows with multiple processor groups. #33340

Open
alvarocabanas opened this issue Jun 3, 2024 · 5 comments
Labels

Comments

@alvarocabanas
Copy link

alvarocabanas commented Jun 3, 2024

Component(s)

receiver/hostmetricsreceiver

What happened?

Description

On a windows instance having 48 * 2 logical cpus, windows groups CPUs into batches of 64 logical cpus in a processor group, but Gopsutil's Cpu.TimesWitContext defined on the cpuscraper in here and used to by the resource detector to calculate the cpu times and utilization, indistinctly gets this data from one of the two processor groups and not all of them.

This is generating variable cpu Utilization, sometimes negative and sometimes full usage even if only half the cores are running.

This bug is reported on the gopsutil library.

Steps to Reproduce

In our case we reproduced it on a 'm5n.metal' machine in AWS with windows-server-22 but we know reports of it happening in other windows with more than one processor groups.

Expected Result

Correct Cpu Usage and times.

Actual Result

Cpu data points from one of the 2 processor groups randomly.

Collector version

v0.101.0

Environment information

Environment

OS: Windows-server 22

OpenTelemetry Collector configuration

receivers:
    hostmetrics:
      collection_interval: 20s
      scrapers:
        cpu:
          metrics:
            system.cpu.time:
              enabled: true
            system.cpu.utilization:
              enabled: true
  processors:
    # group system.cpu metrics by cpu
    metricstransform:
      transforms:
        - include: system.cpu.utilization
          action: update
          operations:
            - action: aggregate_labels
              label_set: [ state ]
              aggregation_type: mean
[ ... ]

Log output

No response

Additional context

No response

@alvarocabanas alvarocabanas added bug Something isn't working needs triage New item requiring triage labels Jun 3, 2024
Copy link
Contributor

github-actions bot commented Jun 4, 2024

Pinging code owners for receiver/hostmetrics: @dmitryax @braydonk. See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@atoulme
Copy link
Contributor

atoulme commented Oct 11, 2024

Unfortunately, the fix seems to reside in gopsutil.

@atoulme atoulme removed the needs triage New item requiring triage label Oct 11, 2024
@github-actions github-actions bot removed the Stale label Oct 12, 2024
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Dec 12, 2024
@rogercoll
Copy link
Contributor

I agree that the solution involves fixing it in the gopsutil package. Meanwhile, it might be worth checking what is the metric's value but using the Windows Perf Counters receiver: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/windowsperfcountersreceiver#configuration

@github-actions github-actions bot removed the Stale label Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants