-
Notifications
You must be signed in to change notification settings - Fork 995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add saturation metric for the runFSM and run (main) goroutines. #488
Labels
Comments
boxofrad
added a commit
that referenced
this issue
Feb 2, 2022
Adds metrics suggested in #488, to record the percentage of time the main and FSM goroutines are busy with work vs available to accept new work, to give operators an idea of how close they are to hitting capacity limits. We keep 256 samples in memory for each metric, and update gauges (at most) once a second, possibly less if the goroutines are idle.
boxofrad
added a commit
that referenced
this issue
Feb 2, 2022
Adds metrics suggested in #488, to record the percentage of time the main and FSM goroutines are busy with work vs available to accept new work, to give operators an idea of how close they are to hitting capacity limits. We keep 256 samples in memory for each metric, and update gauges (at most) once a second, possibly less if the goroutines are idle. This should be ok because it's unlikely that a goroutine would go from very high saturation to being completely idle (so at worst we'll leave the gauge on the previous low value).
boxofrad
added a commit
that referenced
this issue
Apr 27, 2022
Adds metrics suggested in #488, to record the percentage of time the main and FSM goroutines are busy with work vs available to accept new work, to give operators an idea of how close they are to hitting capacity limits. We keep 256 samples in memory for each metric, and update gauges (at most) once a second, possibly less if the goroutines are idle. This should be ok because it's unlikely that a goroutine would go from very high saturation to being completely idle (so at worst we'll leave the gauge on the previous low value).
boxofrad
added a commit
that referenced
this issue
Apr 27, 2022
Adds metrics suggested in #488, to record the percentage of time the main and FSM goroutines are busy with work vs available to accept new work, to give operators an idea of how close they are to hitting capacity limits.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The two primary goroutines used by Raft (runFSM, and run (main)) are single threaded operations. They can saturate (take more than 100% of the available time to handle the incoming workload) before the CPU of a system reaches 100% utilization.
When this happens it may be possible to observe the problem using some existing metrics (ex:
fsm.apply
time), but properly interpreting those metrics requires deep knowledge of how raft works. It may also be a challenge to present the data on a dashboard because it requires summing the time, and knowing the aggregation period of the metrics to interpret the summed result.The existing metrics may also not fully capture the time, because they only measure specific operations done by those goroutines, not the full work vs idle time.
This issue proposes adding two new metrics (one for each goroutine) which measure the amount of time those goroutines spent doing work. When compared to the wall clock time, this gives us a clear signal about the saturation of these operations, and how much buffer there is before the incoming work starts to cause a backlog.
The text was updated successfully, but these errors were encountered: