Add a gauge to hold the last restore time that #454
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is updated on every restore including startup from disk as well as RPC (leader/user initiated restore).
It's a gauge to provide a constant reading of the last restore time which may have been days or weeks ago in a healthy server but is useful to be able to graph along side the trailing logs age to check if there is danger of restores taking longer than trailing logs can cover.
It differs from fsm.restore both because it is omitted at startup and because it is a gauge rather than a single point in time sample.
This is a follow up to #452 (and relates to hashicorp/consul#9609) after realising the current restore metrics are not really suitable to easily monitor whether the last restore too dangerously long or not!
Having this gauge means operators only need to compare two gauge values over time to get a good indicator of if their cluster is in danger of becoming unrecoverable.