Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add missing dcgm metrics #710

Merged
merged 1 commit into from
Jun 18, 2024

Conversation

annapendleton
Copy link
Collaborator

Small change to add a few missing DCGM metrics

@annapendleton
Copy link
Collaborator Author

/gcbrun

@kfswain
Copy link
Collaborator

kfswain commented Jun 18, 2024

Is this just to make sure any infra spun up by the terraform has these metrics, so it can get picked up by the prom scraper?
Do we want to extend the runner metric capture to include this also?

Copy link
Collaborator

@kfswain kfswain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just small question WRT metric usage

@annapendleton
Copy link
Collaborator Author

Is this just to make sure any infra spun up by the terraform has these metrics, so it can get picked up by the prom scraper? Do we want to extend the runner metric capture to include this also?

Yee, for this PR it's mainly scoped to that - these metrics aren't being scraped for any infra run, and we want them to be.

The runner currently only captures GPU utilization IIRC, vs all of the related DCGM metrics captured at the DCGM exporter layer.

For what we should include in the runner - not all of these metrics are immediately useful for analysis. I think it's a great idea to add in the useful ones - eg. memory usage and power usage seem to be 2 important ones in our autoscaling discussions more recently. I'm thinking it's a good idea to add those in a follow up PR :)

@annapendleton annapendleton merged commit 01630c6 into GoogleCloudPlatform:main Jun 18, 2024
5 checks passed
PBundyra pushed a commit to PBundyra/ai-on-gke that referenced this pull request Jun 21, 2024
@annapendleton annapendleton deleted the pwrusg branch August 14, 2024 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants