Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(pkg/bpf): Replace libbpfgo with cilium/ebpf #1438

Merged
merged 1 commit into from
Jul 9, 2024

Conversation

dave-tucker
Copy link
Collaborator

@dave-tucker dave-tucker commented May 15, 2024

cilium/ebpf is a pure Go eBPF package and is used in a number of popular
cloud-native projects. The benefits to Kepler are:

  1. Bytecode is built using bpf2go and the C and Go structs are kept in
    sync automatically
  2. There is no requirement for Cgo anymore and therefore no requirement
    to have libbpf or libelf installed to compile and/or to be
    dynamically linked at runtime
  3. Simplified packaging as the correct bytecode is contained within the
    kepler binary

Overall I'm happy with this change, but there is only one thing that
bugs me.

We have to check in the bytecode object files (e.g kepler.bpfel.o) or
the Go tooling (go lint/go vet) complains about the missing files.
I couldn't reliably get go generate ./... to work to compile these
files in CI. This is something which should be relatively easy to fix
in the Makefile/CI environment before we cut a release.

Depends-On: #1435
Fixes: #1424

@dave-tucker dave-tucker force-pushed the cilium-ebpf branch 7 times, most recently from 5fa556a to 42303d8 Compare May 15, 2024 17:53
@vimalk78
Copy link
Collaborator

@rootfs @dave-tucker which release is this targeted for?

@dave-tucker
Copy link
Collaborator Author

I don't have a target in mind. But I doubt it will be ready for review until Friday at the earliest.

@dave-tucker dave-tucker force-pushed the cilium-ebpf branch 2 times, most recently from 08d0a91 to d031ddd Compare May 16, 2024 12:56
@dave-tucker dave-tucker marked this pull request as ready for review May 16, 2024 12:56
@dave-tucker dave-tucker changed the title Migrate to cilium/ebpf chore(pkg/bpf): Replace libbpfgo with cilium/ebpf May 16, 2024
@dave-tucker dave-tucker marked this pull request as draft May 16, 2024 14:24
@dave-tucker
Copy link
Collaborator Author

I thought I was close, but I think I found an issue in #1443 that causes failures when perf events can't be opened for some reason. I've cherry-picked that here to see if it makes CI green - it applies to the libbpfgo code to so it's another PR.

Copy link

github-actions bot commented May 20, 2024

🤖 SeineSailor

Here's a concise summary of the pull request changes:

Key Modifications:

  1. Dependency updates: The Kepler project has replaced libbpfgo with cilium/ebpf, eliminating Cgo, libbpf, and libelf dependencies.
  2. Bytecode object files: The project now uses bytecode object files, which need to be checked in.
  3. Internal changes: Some internal functions and structures have been removed or modified, which might affect dependent code.

Impact on Codebase:

  • The external interface and behavior of the code remain largely unchanged.
  • The removal of Cgo, libbpf, and libelf dependencies simplifies the project's dependencies.
  • The introduction of bytecode object files may require adjustments to the build process or testing.

Observations and Suggestions:

  • It's essential to review the internal changes and ensure that dependent code is updated accordingly to avoid potential breakages.
  • Consider adding documentation or comments to explain the reasoning behind the dependency updates and internal changes.
  • Verify that the updated .pre-commit-config.yaml file correctly excludes the bpf/include/ directory to prevent unnecessary checks.

Overall, this pull request refactors the Kepler project's dependencies and internal implementation, which may require careful review and testing to ensure a smooth transition.

@dave-tucker dave-tucker marked this pull request as ready for review May 20, 2024 16:12
pkg/bpf/kepler_bpfel.go Outdated Show resolved Hide resolved
packaging/rpm/kepler.spec Outdated Show resolved Hide resolved
.gitignore Show resolved Hide resolved
Copy link
Collaborator

@sthaha sthaha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@sthaha
Copy link
Collaborator

sthaha commented May 21, 2024

@dave-tucker we may want to look at this crash before merging.
output of make build


❯ sudo ./_output/bin/kepler

I0522 09:33:18.758775 2437499 gpu.go:38] Trying to initialize GPU collector using dcgm
W0522 09:33:18.758911 2437499 gpu_dcgm.go:104] There is no DCGM daemon running in the host: libdcgm.so not Found
W0522 09:33:18.758976 2437499 gpu_dcgm.go:108] Could not start DCGM. Error: libdcgm.so not Found
I0522 09:33:18.758997 2437499 gpu.go:45] Error initializing dcgm: not able to connect to DCGM: libdcgm.so not Found
I0522 09:33:18.759008 2437499 gpu.go:38] Trying to initialize GPU collector using nvidia-nvml
I0522 09:33:18.759111 2437499 gpu.go:45] Error initializing nvidia-nvml: failed to init nvml. ERROR_LIBRARY_NOT_FOUND
I0522 09:33:18.759121 2437499 gpu.go:38] Trying to initialize GPU collector using dummy
I0522 09:33:18.759134 2437499 gpu.go:42] Using dummy to obtain gpu power
I0522 09:33:18.760886 2437499 exporter.go:85] Kepler running on version: v0.7.10-5-gf1097c0e2c498-dirty
I0522 09:33:18.760897 2437499 config.go:283] using gCgroup ID in the BPF program: true
I0522 09:33:18.760963 2437499 config.go:285] kernel version: 6.7
I0522 09:33:18.761045 2437499 config.go:310] The Idle power will be exposed. Are you running on Baremetal or using single VM per node?
I0522 09:33:18.761127 2437499 redfish.go:169] failed to get redfish credential file path
I0522 09:33:18.761632 2437499 acpi.go:71] Could not find any ACPI power meter path. Is it a VM?
I0522 09:33:18.763071 2437499 exporter.go:94] Number of CPUs: 8
I0522 09:33:19.257525 2437499 watcher.go:67] Using in cluster k8s config
I0522 09:33:19.257541 2437499 watcher.go:74] failed to get config: unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined
I0522 09:33:19.257553 2437499 watcher.go:126] k8s APIserver watcher was not enabled
I0522 09:33:19.257642 2437499 prometheus_collector.go:96] Registered Container Prometheus metrics
I0522 09:33:19.257688 2437499 prometheus_collector.go:101] Registered VM Prometheus metrics
I0522 09:33:19.257724 2437499 prometheus_collector.go:105] Registered Node Prometheus metrics
I0522 09:33:19.257826 2437499 node_platform_energy.go:54] Failed to create Regressor/AbsPower Power Model to estimate Node Platform Power: open /var/lib/kepler/data/acpi_AbsPowerModel.json: no such fi
le or directory
I0522 09:33:19.257965 2437499 exporter.go:175] starting to listen on 0.0.0.0:8888
I0522 09:33:19.257980 2437499 exporter.go:181] Started Kepler in 497.120003ms
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x854f72]

goroutine 53 [running]:
github.com/sustainable-computing-io/kepler/pkg/collector/stats/types.(*UInt64StatCollection).AddDeltaStat(0x0, {0x18b8388, 0x7}, 0x0)
        kepler/pkg/collector/stats/types/types.go:108 +0x32
github.com/sustainable-computing-io/kepler/pkg/collector/resourceutilization/bpf.updateSWCounters(0x16a0640?, 0xc000190150, 0x1?)
        kepler/pkg/collector/resourceutilization/bpf/process_bpf_collector.go:40 +0xc5
github.com/sustainable-computing-io/kepler/pkg/collector/resourceutilization/bpf.UpdateProcessBPFMetrics({0x1b40fe8, 0xc000692f00}, 0xc0000736d0?)
        kepler/pkg/collector/resourceutilization/bpf/process_bpf_collector.go:129 +0xa39
github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).updateProcessResourceUtilizationMetrics(0xc002bfe4c0, 0x0?)
        kepler/pkg/collector/metric_collector.go:191 +0x5d
github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).updateResourceUtilizationMetrics(0xc002bfe4c0)
        kepler/pkg/collector/metric_collector.go:155 +0x54
github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).Update(0xb2d05e00?)
        kepler/pkg/collector/metric_collector.go:106 +0x53
github.com/sustainable-computing-io/kepler/pkg/manager.(*CollectorManager).Start.func1()
        kepler/pkg/manager/manager.go:74 +0x7b
created by github.com/sustainable-computing-io/kepler/pkg/manager.(*CollectorManager).Start in goroutine 1
        kepler/pkg/manager/manager.go:66 +0x65

@sthaha
Copy link
Collaborator

sthaha commented May 21, 2024

I get the same using the compose file

❯ cd manifests/compose/dev
❯ docker compose up --build kepler-dev

kepler-dev-1  | I0521 23:32:42.275436 2436610 process_bpf_collector.go:88] process worker (pid=2202720, cgroup=1466240) has 0 task clock time 2344773 CPU cycles, 883246 instructions, 232766 cache miss
es, 68 page cache hits
kepler-dev-1  | panic: runtime error: invalid memory address or nil pointer dereference
kepler-dev-1  | [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x8788f2]
kepler-dev-1  |
kepler-dev-1  | goroutine 30 [running]:
kepler-dev-1  | github.com/sustainable-computing-io/kepler/pkg/collector/stats/types.(*UInt64StatCollection).AddDeltaStat(0x0, {0x18ee6d1, 0x7}, 0x0)
kepler-dev-1  |         /workspace/pkg/collector/stats/types/types.go:108 +0x32
kepler-dev-1  | github.com/sustainable-computing-io/kepler/pkg/collector/resourceutilization/bpf.updateSWCounters(0x16d05e0?, 0xc0003e01c0, 0x219c60?)
kepler-dev-1  |         /workspace/pkg/collector/resourceutilization/bpf/process_bpf_collector.go:40 +0xc5
kepler-dev-1  | github.com/sustainable-computing-io/kepler/pkg/collector/resourceutilization/bpf.UpdateProcessBPFMetrics({0x1b7aea8, 0xc0005977c0}, 0xc00048aed0?)
kepler-dev-1  |         /workspace/pkg/collector/resourceutilization/bpf/process_bpf_collector.go:129 +0xa39
kepler-dev-1  | github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).updateProcessResourceUtilizationMetrics(0xc002bcc500, 0x0?)
kepler-dev-1  |         /workspace/pkg/collector/metric_collector.go:191 +0x5d
kepler-dev-1  | github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).updateResourceUtilizationMetrics(0xc002bcc500)
kepler-dev-1  |         /workspace/pkg/collector/metric_collector.go:155 +0x54
kepler-dev-1  | github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).Update(0xb2d05e00?)
kepler-dev-1  |         /workspace/pkg/collector/metric_collector.go:106 +0x53
kepler-dev-1  | github.com/sustainable-computing-io/kepler/pkg/manager.(*CollectorManager).Start.func1()
kepler-dev-1  |         /workspace/pkg/manager/manager.go:74 +0x7b
kepler-dev-1  | created by github.com/sustainable-computing-io/kepler/pkg/manager.(*CollectorManager).Start in goroutine 1
kepler-dev-1  |         /workspace/pkg/manager/manager.go:66 +0x65
kepler-dev-1 exited with code 2

@dave-tucker
Copy link
Collaborator Author

@sthaha I just force pushed to fix a small bug in the cleanup logic (forgetting to clear the enabledHardwareCounters set).
I know that segfault happens when there is a disagreement between what pkg/bpf declares as supported and what pkg/collector/stats uses to populate the map on the Stats struct, but that should have been taken care of in #1443.

I've tried locally on Fedora 40 with both running the binary and compose and can't reproduce. I've also tried on RHEL 9.4 and RHEL 9.2 and still can't reproduce. I'm also not seeing anything in the logs stating that any of the eBPF probes couldn't be loaded 🤔

TL;DR I'm a bit stuck on fixing this as I can't reproduce the issue.

@sthaha
Copy link
Collaborator

sthaha commented May 23, 2024

I tried with the latest, commit 6d8539b and runs fine.

@dave-tucker
Copy link
Collaborator Author

@dave-tucker can you sync this PR with main?

Done!

@dave-tucker dave-tucker force-pushed the cilium-ebpf branch 3 times, most recently from d64c96f to 1aac060 Compare June 14, 2024 13:24
@dave-tucker dave-tucker force-pushed the cilium-ebpf branch 3 times, most recently from 5fa77d5 to 138a422 Compare June 17, 2024 14:19
@sthaha
Copy link
Collaborator

sthaha commented Jun 18, 2024

LGTM

@rootfs
Copy link
Contributor

rootfs commented Jul 2, 2024

Per 07/02 meeting, this PR has many usability improvements. @SamYuan1990 will check with s390 team for last check.

cilium/ebpf is a pure Go eBPF package and is used in a number of popular
cloud-native projects. The benefits to Kepler are:

1. Bytecode is built using bpf2go and the C and Go structs are kept in
   sync automatically
2. There is no requirement for Cgo anymore and therefore no requirement
   to have libbpf or libelf installed to compile and/or to be
   dynamically linked at runtime
3. Simplified packaging as the correct bytecode is contained within the
   kepler binary

Overall I'm happy with this change, but there is only one thing that
bugs me.

We have to check in the bytecode object files (e.g kepler.bpfel.o) or
the Go tooling (go lint/go vet) complains about the missing files.
I couldn't reliably get `go generate ./...` to work to compile these
files in CI. This is something which should be relatively easy to fix
in the Makefile/CI environment before we cut a release.

Signed-off-by: Dave Tucker <dave@dtucker.co.uk>
@dave-tucker
Copy link
Collaborator Author

@SamYuan1990 will check with s390 team for last check.

Hi @SamYuan1990, any update? We'd like to get this merged before we cut the 0.7.11 release later this week.

@sthaha
Copy link
Collaborator

sthaha commented Jul 8, 2024

I tested this one last time on my machine using the dev compose and pretty happy with the result.

image

@dave-tucker dave-tucker added this to the v0.7.11 milestone Jul 9, 2024
@sthaha
Copy link
Collaborator

sthaha commented Jul 9, 2024

@SamYuan1990 I am going to merge this in the interest of making the release tomorrow and to avoid rebases.
I have tested this on on x86_64 worked well. Please let us know if you find any issue and we can revert the commit

@sthaha sthaha merged commit d3c2906 into sustainable-computing-io:main Jul 9, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace libbpfgo with cilium/ebpf
5 participants