Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add oldest log metric to leader #452

Merged
merged 3 commits into from
Apr 6, 2021
Merged

Add oldest log metric to leader #452

merged 3 commits into from
Apr 6, 2021

Conversation

banks
Copy link
Member

@banks banks commented Mar 26, 2021

This adds raft.leader.oldestLogAge metric that can be used to monitor how long in wall-clock time the leader has logs for. This is important to be able to understand when a raft cluster is getting close to an unrecoverable state where snapshot restore takes longer than the amount of time we still have logs available for on the leader. Once in this state followers can never get healthy.

See more info in hashicorp/consul#9609. This also relates to #444 which provides a mechanism to recover from the state.

Review Notes

This adds a timestamp to the Log struct which is new. While it's not part of the contract, all existing LogStore implementations I know of will serialise the new field correctly as they use msgPack or other self-describing encodings. If an implementation did not, it would ignore the new field and not preserve it when it is read back through GetLogs this case would just result in the new metric always reading 0 which seems like a reasonable graceful degredation. The BoltDB log store all HashiCorp products use currently should preserve the new field without further changes (needs verifying but I'm pretty sure).

Existing raft clusters will also have current logs on disk that don't have timestamps set, so until the oldest log preserved is a new one that was written after this change existed in the system, the age emitted will be 0s too.

This seems like reasonable degredation behaviour.

We choose not to emit warnings in logs on any read errors because they would be unnecessarily noisy or not real errors (e.g. no logs because we just restored a snapshot and truncated). In case of a legitimate error reading logs, the actual replication mechanisms in raft would already make those clearly known when trying to read logs etc. without an extra metrics-only goroutine adding periodic error messages to logs too.

Question: Have I missed any other possible negative effects of adding a timestamp field to Log?

raft.go Outdated
@@ -1080,6 +1087,7 @@ func (r *Raft) dispatchLogs(applyLogs []*logFuture) {
lastIndex++
applyLog.log.Index = lastIndex
applyLog.log.Term = term
applyLog.log.AppendedAt = time.Now()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull this out of the loop.

raft.go Show resolved Hide resolved
Copy link
Contributor

@schristoff schristoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor nit, if it's too much work feel free to merge without :)

log_test.go Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants