Add a gauge to hold the last restore time that #454

banks · 2021-04-08T12:25:49Z

This is updated on every restore including startup from disk as well as RPC (leader/user initiated restore).

It's a gauge to provide a constant reading of the last restore time which may have been days or weeks ago in a healthy server but is useful to be able to graph along side the trailing logs age to check if there is danger of restores taking longer than trailing logs can cover.

It differs from fsm.restore both because it is omitted at startup and because it is a gauge rather than a single point in time sample.

This is a follow up to #452 (and relates to hashicorp/consul#9609) after realising the current restore metrics are not really suitable to easily monitor whether the last restore too dangerously long or not!

Having this gauge means operators only need to compare two gauge values over time to get a good indicator of if their cluster is in danger of becoming unrecoverable.

This is updated on every restore including startup from disk as well as RPC (leader/user initiated restore). It's a gauge to provide a constant reading of the last restore time which may have been days or weeks ago in a healthy server but is useful to be able to graph along side the trailing logs age to check if there is danger of restores taking longer than trailing logs can cover. It differes from fsm.restore both because it is omitted at startup and because it is a gauge rather than a single point in time sample.

dnephin

Nice! Some thoughts/questions on specifics, but generally I think adding a metric like this is a really good idea.

api.go

banks · 2021-04-09T14:33:48Z

Thanks @dnephin I think this is a cleaner diff now!

dnephin

Nice, LGTM!

I was thinking of only moving the one metric to a function, but moving both metrics and the restore call works quite nicely as well.

dnephin reviewed Apr 8, 2021

View reviewed changes

api.go Outdated Show resolved Hide resolved

api.go Outdated Show resolved Hide resolved

banks added 2 commits April 9, 2021 15:03

Refactor Restore into a function that records metrics consistently

f7d39f5

Always Be Closing

b619536

dnephin approved these changes Apr 9, 2021

View reviewed changes

banks merged commit 7fa243a into master Apr 12, 2021

banks deleted the restore-metric branch April 12, 2021 15:44

banks mentioned this pull request Apr 14, 2021

Don't expire Prometheus metrics that have been explicitly defined hashicorp/go-metrics#123

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a gauge to hold the last restore time that #454

Add a gauge to hold the last restore time that #454

banks commented Apr 8, 2021

dnephin left a comment

banks commented Apr 9, 2021

dnephin left a comment

Add a gauge to hold the last restore time that #454

Add a gauge to hold the last restore time that #454

Conversation

banks commented Apr 8, 2021

dnephin left a comment

Choose a reason for hiding this comment

banks commented Apr 9, 2021

dnephin left a comment

Choose a reason for hiding this comment