Add saturation metric for the runFSM and run (main) goroutines. #488

dnephin · 2022-01-31T16:48:46Z

The two primary goroutines used by Raft (runFSM, and run (main)) are single threaded operations. They can saturate (take more than 100% of the available time to handle the incoming workload) before the CPU of a system reaches 100% utilization.

When this happens it may be possible to observe the problem using some existing metrics (ex: fsm.apply time), but properly interpreting those metrics requires deep knowledge of how raft works. It may also be a challenge to present the data on a dashboard because it requires summing the time, and knowing the aggregation period of the metrics to interpret the summed result.

The existing metrics may also not fully capture the time, because they only measure specific operations done by those goroutines, not the full work vs idle time.

This issue proposes adding two new metrics (one for each goroutine) which measure the amount of time those goroutines spent doing work. When compared to the wall clock time, this gives us a clear signal about the saturation of these operations, and how much buffer there is before the incoming work starts to cause a backlog.

The text was updated successfully, but these errors were encountered:

Adds metrics suggested in #488, to record the percentage of time the main and FSM goroutines are busy with work vs available to accept new work, to give operators an idea of how close they are to hitting capacity limits. We keep 256 samples in memory for each metric, and update gauges (at most) once a second, possibly less if the goroutines are idle.

Adds metrics suggested in #488, to record the percentage of time the main and FSM goroutines are busy with work vs available to accept new work, to give operators an idea of how close they are to hitting capacity limits. We keep 256 samples in memory for each metric, and update gauges (at most) once a second, possibly less if the goroutines are idle. This should be ok because it's unlikely that a goroutine would go from very high saturation to being completely idle (so at worst we'll leave the gauge on the previous low value).

Adds metrics suggested in #488, to record the percentage of time the main and FSM goroutines are busy with work vs available to accept new work, to give operators an idea of how close they are to hitting capacity limits.

dnephin added the enhancement label Jan 31, 2022

boxofrad mentioned this issue Feb 2, 2022

Thread saturation metrics 📈 #489

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add saturation metric for the runFSM and run (main) goroutines. #488

Add saturation metric for the runFSM and run (main) goroutines. #488

dnephin commented Jan 31, 2022

Add saturation metric for the runFSM and run (main) goroutines. #488

Add saturation metric for the runFSM and run (main) goroutines. #488

Comments

dnephin commented Jan 31, 2022