-
Notifications
You must be signed in to change notification settings - Fork 264
Conversation
/cc @jiaxuanzhou , please also share your comments here :) |
/approve We need this doc for performance metrics; will add lgtm label when it's ready :) |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Jeffwan, k82cn The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
how about adding the metrics : |
I'm ok with them.
If we want to log this info, we need to check whether jobs are ready in |
doc/design/metrics.md
Outdated
| e2e_scheduling_latency | histogram | | E2e scheduling latency in seconds | | ||
| plugin_predicate_evaluation | histogram | | Schedule latency for predicate plugin | | ||
| plugin_proportion_evaluation | histogram | | Schedule latency for proportion plugin | | ||
| plugin_drf_evaluation | histogram | | Schedule latency for drf plugin | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... should we have a suggestion/guidance to the new plugins, e.g. plugin_<plugin name>_evaluation
? If we add a new plugin, we do not need to change other logics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense. Use <plugin_name> as a label will be better.
doc/design/metrics.md
Outdated
| PodScheduleSuccesses | Counter | | The number of kube-batch success in scheduling a job | | ||
| pod_preemption_victims | Counter | | Number of selected preemption victims | | ||
| total_preemption_attempts | Counter | | Total preemption attempts in the cluster till now | | ||
| gang_unschedule_task | Counter | | The number of tasks failed to schedule | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe unschedulable_job
is better :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking if we'd like to measure both? not that sure now.
unschedulable_job
mean how many jobs kube-batch scheduled
unschedule_task
means how many tasks lack of resources and can not be scheduled?
|
||
### kube-batch operations | ||
This metrics describe internal state of kube-batch. | ||
| Metric name | Metric type | Labels | Description | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems we need a empty line above to make it as table :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Will add extra line here.
I will hold doc until implementation merged in order not to bring in any unclear information. |
Let's address comments and get it merged firstly; if there any changes, we can open separate PR according to our implementation :) |
Got you. No problem. Will address comments and push revision for review. |
/hold cancel |
@k82cn @jiaxuanzhou Update docs to address comments. Please have another look |
@Jeffwan lgtm ,thanks |
/lgtm Thanks for your contribution :) |
Add metrics support documentation
Add metrics support documentation
Add metrics support documentation
What this PR does / why we need it:
We'd like to add metrics support for kube-batch in order to better monitor it's performance.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):part of #487
Special notes for your reviewer:
This is working in progress. I just list few metrics here and please have look if they make sense. Please also leave comments on additional key metrics we want to add. Code change will come soon.
Release note: