Skip to content
This repository has been archived by the owner on May 25, 2023. It is now read-only.

Add metrics support documentation #579

Merged
merged 1 commit into from
Jan 31, 2019

Conversation

Jeffwan
Copy link
Contributor

@Jeffwan Jeffwan commented Jan 29, 2019

What this PR does / why we need it:
We'd like to add metrics support for kube-batch in order to better monitor it's performance.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
part of #487

Special notes for your reviewer:
This is working in progress. I just list few metrics here and please have look if they make sense. Please also leave comments on additional key metrics we want to add. Code change will come soon.

Release note:

Add metrics support documentation

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 29, 2019
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 29, 2019
@k82cn
Copy link
Contributor

k82cn commented Jan 29, 2019

/cc @jiaxuanzhou , please also share your comments here :)

@k82cn
Copy link
Contributor

k82cn commented Jan 29, 2019

/approve

We need this doc for performance metrics; will add lgtm label when it's ready :)

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jeffwan, k82cn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 29, 2019
@jiaxuanzhou
Copy link
Contributor

/cc @jiaxuanzhou , please also share your comments here :)

how about adding the metrics :
counter of retry times of one job
scheduling latency of one task ?

@k82cn
Copy link
Contributor

k82cn commented Jan 29, 2019

I'm ok with them.

counter of retry times of one job

If we want to log this info, we need to check whether jobs are ready in OnSessionClose.

| e2e_scheduling_latency | histogram | | E2e scheduling latency in seconds |
| plugin_predicate_evaluation | histogram | | Schedule latency for predicate plugin |
| plugin_proportion_evaluation | histogram | | Schedule latency for proportion plugin |
| plugin_drf_evaluation | histogram | | Schedule latency for drf plugin |
Copy link
Contributor

@k82cn k82cn Jan 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... should we have a suggestion/guidance to the new plugins, e.g. plugin_<plugin name>_evaluation? If we add a new plugin, we do not need to change other logics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense. Use <plugin_name> as a label will be better.

| PodScheduleSuccesses | Counter | | The number of kube-batch success in scheduling a job |
| pod_preemption_victims | Counter | | Number of selected preemption victims |
| total_preemption_attempts | Counter | | Total preemption attempts in the cluster till now |
| gang_unschedule_task | Counter | | The number of tasks failed to schedule |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe unschedulable_job is better :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking if we'd like to measure both? not that sure now.

unschedulable_job mean how many jobs kube-batch scheduled
unschedule_task means how many tasks lack of resources and can not be scheduled?


### kube-batch operations
This metrics describe internal state of kube-batch.
| Metric name | Metric type | Labels | Description |
Copy link
Contributor

@k82cn k82cn Jan 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we need a empty line above to make it as table :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Will add extra line here.

@Jeffwan
Copy link
Contributor Author

Jeffwan commented Jan 30, 2019

I will hold doc until implementation merged in order not to bring in any unclear information.

@k82cn
Copy link
Contributor

k82cn commented Jan 31, 2019

I will hold doc until implementation merged in order not to bring in any unclear information.

Let's address comments and get it merged firstly; if there any changes, we can open separate PR according to our implementation :)

@Jeffwan
Copy link
Contributor Author

Jeffwan commented Jan 31, 2019

I will hold doc until implementation merged in order not to bring in any unclear information.

Let's address comments and get it merged firstly; if there any changes, we can open separate PR according to our implementation :)

Got you. No problem. Will address comments and push revision for review.

@Jeffwan Jeffwan changed the title [WIP] Add metrics support documentation Add metrics support documentation Jan 31, 2019
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 31, 2019
@Jeffwan
Copy link
Contributor Author

Jeffwan commented Jan 31, 2019

/hold cancel

@Jeffwan
Copy link
Contributor Author

Jeffwan commented Jan 31, 2019

@k82cn @jiaxuanzhou Update docs to address comments. Please have another look

@jiaxuanzhou
Copy link
Contributor

@Jeffwan lgtm ,thanks

@k82cn
Copy link
Contributor

k82cn commented Jan 31, 2019

/lgtm

Thanks for your contribution :)

@k8s-ci-robot k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 31, 2019
@k8s-ci-robot k8s-ci-robot merged commit 68c2895 into kubernetes-retired:master Jan 31, 2019
kevin-wangzefeng pushed a commit to kevin-wangzefeng/scheduler that referenced this pull request Jun 28, 2019
kevin-wangzefeng pushed a commit to kevin-wangzefeng/scheduler that referenced this pull request Jun 28, 2019
kevin-wangzefeng pushed a commit to kevin-wangzefeng/scheduler that referenced this pull request Jun 28, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants