[FIXED] Clustering: channel first/last sequence may fall to zero #840

kozlovic · 2019-05-18T18:41:43Z

This could happen if the leader took a snapshot while messages were
not yet expired, then a node is started without state and tries
to restore from this snapshot. If the messages have expired by then,
no message would be stored. If that node later did a snapshot itself,
it would persist in it the first/last being zero. If no message are
published and this node becomes leader, it would start storing
messages at the wrong sequence and would also send the bad snapshot
to other nodes.

Resolves #833

Signed-off-by: Ivan Kozlovic ivan@synadia.com

coveralls · 2019-05-18T18:51:37Z

Coverage increased (+0.007%) to 92.076% when pulling cff1dfe on fix_833 into 51d6217 on master.

derekcollison

LGTM, some minor comments.

server/clustering.go

server/snapshot.go

This could happen if the leader took a snapshot while messages were not yet expired, then a node is started without state and tries to restore from this snapshot. If the messages have expired by then, no message would be stored. If that node later did a snapshot itself, it would persist in it the first/last being zero. If no message are published and this node becomes leader, it would start storing messages at the wrong sequence and would also send the bad snapshot to other nodes. Resolves #833 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>

kozlovic · 2019-05-19T20:30:47Z

@derekcollison I have pushed forced because I did a rebase from master to take the raft shutdown changes. But I kept a separate commit to ease the review. I can squash prior to merge. Commit to review would be this one

- bail out after a number of failed attempts to restore msgs - create snapshot on success if restore msgs indicates that first in channel has moved. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>

derekcollison

Some questions..

server/snapshot.go

derekcollison · 2019-05-19T21:40:10Z

server/snapshot.go

+				// Also don't use startGoRoutine() here since it needs the
+				// server lock, which we already have.
+				s.mu.Lock()
+				s.raft.Snapshot().Error()


Can't do in place correct?

meaning without go routine? if so no, since as I tried to explain, as part of the startup process we call NewRaft() (hashicorp raft constructor), which returns a Raft object that we assign to s.raft.Raft at the end of createRaftNode(). NewRaft() is the one invoking Restore() on startup (when there are local snapshots), so I can't call s.raft.Snapshot() until after the creation of the raft node has completed.

server/snapshot.go

derekcollison

LGTM..

derekcollison approved these changes May 18, 2019

View reviewed changes

server/clustering.go Show resolved Hide resolved

server/snapshot.go Outdated Show resolved Hide resolved

kozlovic force-pushed the fix_833 branch from 562f52e to fbc4114 Compare May 19, 2019 20:29

Updates based on code review

cff1dfe

- bail out after a number of failed attempts to restore msgs - create snapshot on success if restore msgs indicates that first in channel has moved. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>

kozlovic force-pushed the fix_833 branch from fbc4114 to cff1dfe Compare May 19, 2019 20:49

derekcollison reviewed May 19, 2019

View reviewed changes

derekcollison approved these changes May 19, 2019

View reviewed changes

kozlovic merged commit 9177cad into master May 19, 2019

kozlovic deleted the fix_833 branch May 19, 2019 23:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIXED] Clustering: channel first/last sequence may fall to zero #840

[FIXED] Clustering: channel first/last sequence may fall to zero #840

kozlovic commented May 18, 2019

coveralls commented May 18, 2019 •

edited

Loading

derekcollison left a comment

kozlovic commented May 19, 2019

derekcollison left a comment

derekcollison May 19, 2019

kozlovic May 19, 2019

derekcollison left a comment

[FIXED] Clustering: channel first/last sequence may fall to zero #840

[FIXED] Clustering: channel first/last sequence may fall to zero #840

Conversation

kozlovic commented May 18, 2019

coveralls commented May 18, 2019 • edited Loading

derekcollison left a comment

Choose a reason for hiding this comment

kozlovic commented May 19, 2019

derekcollison left a comment

Choose a reason for hiding this comment

derekcollison May 19, 2019

Choose a reason for hiding this comment

kozlovic May 19, 2019

Choose a reason for hiding this comment

derekcollison left a comment

Choose a reason for hiding this comment

coveralls commented May 18, 2019 •

edited

Loading