Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu docs update #1156

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

gpu docs update #1156

wants to merge 4 commits into from

Conversation

mmurph3
Copy link
Contributor

@mmurph3 mmurph3 commented Dec 2, 2024

Related Issue

Proposed Changes

@mmurph3 mmurph3 requested a review from a team as a code owner December 2, 2024 20:55
Copy link
Contributor

@chipzoller chipzoller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First sentence isn't exactly true. Second one belongs on the Efficiency page.

@thomasvn
Copy link
Member

thomasvn commented Dec 2, 2024

Thanks @mmurph3 for starting this! This is an effort to address some confusion raised by users in SUP-6416.

Agree with Chip here that the first bullet point isn't necessarily true. It's probably safe to remove it.

The second bullet point is good. How about we modify it slightly to the following?

## Troubleshooting

### Kubecost dashboards not showing GPU Efficiency or GPU Savings

In order for Kubecost to begin displaying GPU features, it must first detect that **at least one** of your clusters has a nonzero amount of GPU usage. Please validate that DCGM-Exporter is running in the clusters which have GPUs and that Kubecost is scraping nonzero GPU metrics from the exporter.

It may be good to add this "Troubleshooting" section to this doc here, as well as the Efficiency doc we have. https://docs.kubecost.com/using-kubecost/navigating-the-kubecost-ui/efficiency-dashboard

@chipzoller
Copy link
Contributor

If we're going to create a public Troubleshooting section specific to GPU, we may want to take this opportunity to build it out more completely à la what I have put together here (internal resource).

@mmurph3
Copy link
Contributor Author

mmurph3 commented Dec 4, 2024

Made some changes before seeing the most recent comments. If we don't agree with what I wrote, I'm ok with changing/moving it. I agree, a built out troubleshooting doc would be good.

@chipzoller , for some reason I'm getting 403'd on that internal link you gave: https://app.gitbook.com/o/MQuX6uFwV0j7vIHtR15E/s/xLM07kCOoiNtRubOhU77/customer-nvidia-gpu-troubleshooting#no-gpu-column-in-efficiency-page

I can see about getting access through Cliff.

@thomasvn
Copy link
Member

thomasvn commented Dec 4, 2024

@mmurph3 These are good changes, but are you sure it's enough? I'm concerned these small add-ons may be missed by some users. Hard to catch the sentence in a long document.

What do you think about adding a "Troubleshooting" section to both these docs, and filling in a bit more details about what to do in the event that they are not seeing GPU Efficiency Features?

@chipzoller
Copy link
Contributor

If you guys don't mind, I'd like to take this over and propose some changes here. It just will have to be next week.

@thomasvn
Copy link
Member

thomasvn commented Dec 5, 2024

@chipzoller Good with me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants