Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

[Close as Dup] display resource utility per vc/queue metrics #2208

Closed
xudifsd opened this issue Feb 26, 2019 · 13 comments
Closed

[Close as Dup] display resource utility per vc/queue metrics #2208

xudifsd opened this issue Feb 26, 2019 · 13 comments

Comments

@xudifsd
Copy link
Member

xudifsd commented Feb 26, 2019

What would you like to be added:

A graph/list to show resource util per vc/queue

Why is this needed:

User sometime found their job is always in waiting state, this may be due to resource not enough in VC or resource fragment in node, need a graph to help them debug the issue

Without this feature, how does the current module work

User check yarn's page manually.

Components that may involve changes:

  • yarn-exporter
  • grafana
@xudifsd xudifsd added this to the 0.10.0 milestone Feb 26, 2019
@xudifsd xudifsd self-assigned this Feb 26, 2019
@xudifsd xudifsd modified the milestones: 0.10.0, 0.11.0 Feb 26, 2019
@scarlett2018 scarlett2018 changed the title display resource util per vc/queue metrics in grafana display resource utility per vc/queue metrics in grafana Feb 26, 2019
@Anbang-Hu
Copy link

Is it possible to have metrics showing current resource fragmentation status (per VC)? We should manage users' expectation properly.

@xudifsd
Copy link
Member Author

xudifsd commented Feb 28, 2019

I'm not sure how to show resource fragmentation status, showing node count with certain free card number? How about memory? Like what hadoop showing disk?

image

But this is the whole cluster status, I think no vc view about this. Am I right? @mzmssg

@fanyangCS
Copy link
Contributor

related. #2013

@fanyangCS
Copy link
Contributor

@xudifsd , a high priority issue is to show how many jobs are in waiting state in a VC.

@scarlett2018
Copy link
Member

This is fit for the "Virtual Cluster" Page refactoring. Currently, when click on the VC instance name it goes to the single VC page. The current single VC page only has the generic VC info. We can add this info above the current table.

VC Table
image

Single VC Page
image

@xudifsd
Copy link
Member Author

xudifsd commented Feb 28, 2019

@fanyangCS yes, yarn-exporter should get that metric. We will need to display that and node count with certain free gpu card, memory, cpu graph in some page, so user can get a sense why their job is waiting.

@mzmssg
Copy link
Member

mzmssg commented Mar 1, 2019

@fanyangCS

a high priority issue is to show how many jobs are in waiting state in a VC.

What's the definition of waiting status? AM launcher but no job container is regarded as RUNNING or WAITING?

@yqwang-ms
Copy link
Member

Job Waiting: Job Not Complete and Not Exist Container is Running

@xudifsd xudifsd changed the title display resource utility per vc/queue metrics in grafana display resource utility per vc/queue metrics Mar 12, 2019
@xudifsd
Copy link
Member Author

xudifsd commented Mar 12, 2019

image

As discussed in meeting, I think above picture may captured what user may find useful. It has three tables and one histogram. Meaning of each table:

  • jobs pending: pending job count of each VC
  • containers pending: pending container count of each VC
  • free resources: resource available in each VC, this table has already shown in virtual cluster page right now
  • free gpu histogram: show node count with certain free GPU, so user could know if certain task can fit in cluster, take the picture as example, tasks require 8 GPU will never be scheduled, since there are no nodes with 8 free GPU.

All required metrics has been exported by #2289

@scarlett2018 Maybe experience team can take over this job?

@fanyangCS
Copy link
Contributor

#1943

@fanyangCS
Copy link
Contributor

#1989

@scarlett2018
Copy link
Member

@xudifsd - sure, but I don't think experience team has capacity to do this in 0.11.0.
cc @weixingzhang - do your team want to take this? or are there any comments around this design?

@scarlett2018 scarlett2018 assigned qfyin and scarlett2018 and unassigned qfyin and scarlett2018 Mar 18, 2019
@scarlett2018 scarlett2018 changed the title display resource utility per vc/queue metrics [Close as Dup] display resource utility per vc/queue metrics Apr 16, 2019
@scarlett2018
Copy link
Member

Will address this in #2539

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants