-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIXED] Clustering: channel first/last sequence may fall to zero #840
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, some minor comments.
This could happen if the leader took a snapshot while messages were not yet expired, then a node is started without state and tries to restore from this snapshot. If the messages have expired by then, no message would be stored. If that node later did a snapshot itself, it would persist in it the first/last being zero. If no message are published and this node becomes leader, it would start storing messages at the wrong sequence and would also send the bad snapshot to other nodes. Resolves #833 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
@derekcollison I have pushed forced because I did a rebase from master to take the raft shutdown changes. But I kept a separate commit to ease the review. I can squash prior to merge. Commit to review would be this one |
- bail out after a number of failed attempts to restore msgs - create snapshot on success if restore msgs indicates that first in channel has moved. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions..
// Also don't use startGoRoutine() here since it needs the | ||
// server lock, which we already have. | ||
s.mu.Lock() | ||
s.raft.Snapshot().Error() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't do in place correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meaning without go routine? if so no, since as I tried to explain, as part of the startup process we call NewRaft() (hashicorp raft constructor), which returns a Raft object that we assign to s.raft.Raft at the end of createRaftNode(). NewRaft() is the one invoking Restore() on startup (when there are local snapshots), so I can't call s.raft.Snapshot() until after the creation of the raft node has completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM..
This could happen if the leader took a snapshot while messages were
not yet expired, then a node is started without state and tries
to restore from this snapshot. If the messages have expired by then,
no message would be stored. If that node later did a snapshot itself,
it would persist in it the first/last being zero. If no message are
published and this node becomes leader, it would start storing
messages at the wrong sequence and would also send the bad snapshot
to other nodes.
Resolves #833
Signed-off-by: Ivan Kozlovic ivan@synadia.com