Add oldest log metric to leader #452

banks · 2021-03-26T23:03:42Z

This adds raft.leader.oldestLogAge metric that can be used to monitor how long in wall-clock time the leader has logs for. This is important to be able to understand when a raft cluster is getting close to an unrecoverable state where snapshot restore takes longer than the amount of time we still have logs available for on the leader. Once in this state followers can never get healthy.

See more info in hashicorp/consul#9609. This also relates to #444 which provides a mechanism to recover from the state.

Review Notes

This adds a timestamp to the Log struct which is new. While it's not part of the contract, all existing LogStore implementations I know of will serialise the new field correctly as they use msgPack or other self-describing encodings. If an implementation did not, it would ignore the new field and not preserve it when it is read back through GetLogs this case would just result in the new metric always reading 0 which seems like a reasonable graceful degredation. The BoltDB log store all HashiCorp products use currently should preserve the new field without further changes (needs verifying but I'm pretty sure).

Existing raft clusters will also have current logs on disk that don't have timestamps set, so until the oldest log preserved is a new one that was written after this change existed in the system, the age emitted will be 0s too.

This seems like reasonable degredation behaviour.

We choose not to emit warnings in logs on any read errors because they would be unnecessarily noisy or not real errors (e.g. no logs because we just restored a snapshot and truncated). In case of a legitimate error reading logs, the actual replication mechanisms in raft would already make those clearly known when trying to read logs etc. without an extra metrics-only goroutine adding periodic error messages to logs too.

Question: Have I missed any other possible negative effects of adding a timestamp field to Log?

banks · 2021-03-29T19:24:32Z

raft.go

@@ -1080,6 +1087,7 @@ func (r *Raft) dispatchLogs(applyLogs []*logFuture) {
 		lastIndex++
 		applyLog.log.Index = lastIndex
 		applyLog.log.Term = term
+		applyLog.log.AppendedAt = time.Now()


Pull this out of the loop.

raft.go

schristoff

One minor nit, if it's too much work feel free to merge without :)

log_test.go

Add oldest log metric to leader

6897763

banks commented Mar 29, 2021

View reviewed changes

Pull time.Now out of a tight loop

feaa535

mkeeler reviewed Mar 30, 2021

View reviewed changes

raft.go Show resolved Hide resolved

schristoff approved these changes Mar 31, 2021

View reviewed changes

log_test.go Outdated Show resolved Hide resolved

Move test helpers

e5de142

briankassouf approved these changes Apr 2, 2021

View reviewed changes

banks merged commit f3ecdb6 into master Apr 6, 2021

banks deleted the log-metrics branch April 6, 2021 11:28

This was referenced Apr 8, 2021

Add a gauge to hold the last restore time that #454

Merged

Don't expire Prometheus metrics that have been explicitly defined hashicorp/go-metrics#123

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add oldest log metric to leader #452

Add oldest log metric to leader #452

banks commented Mar 26, 2021

banks Mar 29, 2021

schristoff left a comment

Add oldest log metric to leader #452

Add oldest log metric to leader #452

Conversation

banks commented Mar 26, 2021

Review Notes

banks Mar 29, 2021

Choose a reason for hiding this comment

schristoff left a comment

Choose a reason for hiding this comment