Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a gauge to hold the last restore time that #454

Merged
merged 3 commits into from
Apr 12, 2021
Merged

Conversation

banks
Copy link
Member

@banks banks commented Apr 8, 2021

This is updated on every restore including startup from disk as well as RPC (leader/user initiated restore).

It's a gauge to provide a constant reading of the last restore time which may have been days or weeks ago in a healthy server but is useful to be able to graph along side the trailing logs age to check if there is danger of restores taking longer than trailing logs can cover.

It differs from fsm.restore both because it is omitted at startup and because it is a gauge rather than a single point in time sample.

This is a follow up to #452 (and relates to hashicorp/consul#9609) after realising the current restore metrics are not really suitable to easily monitor whether the last restore too dangerously long or not!

Having this gauge means operators only need to compare two gauge values over time to get a good indicator of if their cluster is in danger of becoming unrecoverable.

This is updated on every restore including startup from disk as well as RPC (leader/user initiated restore).

It's a gauge to provide a constant reading of the last restore time which may have been days or weeks ago in a healthy server but is useful to be able to graph along side the trailing logs age to check if there is danger of restores taking longer than trailing logs can cover.

It differes from fsm.restore both because it is omitted at startup and because it is a gauge rather than a single point in time sample.
Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Some thoughts/questions on specifics, but generally I think adding a metric like this is a really good idea.

api.go Outdated Show resolved Hide resolved
api.go Outdated Show resolved Hide resolved
@banks
Copy link
Member Author

banks commented Apr 9, 2021

Thanks @dnephin I think this is a cleaner diff now!

Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, LGTM!

I was thinking of only moving the one metric to a function, but moving both metrics and the restore call works quite nicely as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants