-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcdserver: add learner metrics #10731
Conversation
docs/metrics/latest
Outdated
etcd_server_is_learner | ||
|
||
# name: "etcd_debugging_learner_promote_failures" | ||
# description: "The total number of learner promote failures (likely learner not ready)." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this counter will only get updated if the local member is leader, right? Consider a user sending promote requests to a server which is not the leader at the moment. Then the user checks metrics, and sees 0 promote failures and 0 promote success. Feels like the description needs to be clearer to avoid user confusion. Maybe 'The total number of learner promotion failures while the server is leader'?
Also there are 3 possible reasons that a promote could fail, maybe we should not suggest it is likely to be one of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get your point. user can use prometheus sum query to get the total number of the failures. As a debugging metrics I think it is fine add failure reasons as the metrics label. @jingyih
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
Is this still WIP? This PR only contains doc change.
This will be useful. Can we update this PR with the actual code change? |
We also need changelog (maybe @jingyih can collect all the changes at once in one PR). |
This is blocked by #10730. |
#10730 merged. Please update PR. @WIZARD-CXY |
@jingyih will do |
b6be3cd
to
819a554
Compare
@jingyih I check the code and find that it is hard to monitor whether the member is a learner or not real time. The code now is not ready for this yet. PTAL? |
Can we update the Line 2236 in 819a554
Whenever there is an raft entry to add node / add learner node / promote node, and the entry's node ID is the same as local member ID, update learner status. |
good idea! |
@jingyih ptal. Changed according to your suggestions and I tested on my local machine. I set up a 3node cluster and add one learner. below is part of merics result. test passed.
|
4bb39ad
to
86e168f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm after nits.
86e168f
to
7d3cfba
Compare
@jingyih updated ptal |
7d3cfba
to
0b8727b
Compare
lgtm |
Can you send another PR to add the new metrics to the changelog 3.4? |
@jingyih consider it done |
@jingyih @xiang90 Add learner metrics.
I suppose we will get this
etcd_debugging_learner_promote_failures
andetcd_debugging_learner_promote_success
from leader.