Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This adds
raft.leader.oldestLogAge
metric that can be used to monitor how long in wall-clock time the leader has logs for. This is important to be able to understand when a raft cluster is getting close to an unrecoverable state where snapshot restore takes longer than the amount of time we still have logs available for on the leader. Once in this state followers can never get healthy.See more info in hashicorp/consul#9609. This also relates to #444 which provides a mechanism to recover from the state.
Review Notes
This adds a timestamp to the
Log
struct which is new. While it's not part of the contract, all existingLogStore
implementations I know of will serialise the new field correctly as they use msgPack or other self-describing encodings. If an implementation did not, it would ignore the new field and not preserve it when it is read back throughGetLogs
this case would just result in the new metric always reading0
which seems like a reasonable graceful degredation. The BoltDB log store all HashiCorp products use currently should preserve the new field without further changes (needs verifying but I'm pretty sure).Existing raft clusters will also have current logs on disk that don't have timestamps set, so until the oldest log preserved is a new one that was written after this change existed in the system, the age emitted will be 0s too.
This seems like reasonable degredation behaviour.
We choose not to emit warnings in logs on any read errors because they would be unnecessarily noisy or not real errors (e.g. no logs because we just restored a snapshot and truncated). In case of a legitimate error reading logs, the actual replication mechanisms in raft would already make those clearly known when trying to read logs etc. without an extra metrics-only goroutine adding periodic error messages to logs too.
Question: Have I missed any other possible negative effects of adding a timestamp field to
Log
?