Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIXED] Clustering: channel first/last sequence may fall to zero #840

Merged
merged 2 commits into from
May 19, 2019

Conversation

kozlovic
Copy link
Member

This could happen if the leader took a snapshot while messages were
not yet expired, then a node is started without state and tries
to restore from this snapshot. If the messages have expired by then,
no message would be stored. If that node later did a snapshot itself,
it would persist in it the first/last being zero. If no message are
published and this node becomes leader, it would start storing
messages at the wrong sequence and would also send the bad snapshot
to other nodes.

Resolves #833

Signed-off-by: Ivan Kozlovic ivan@synadia.com

@coveralls
Copy link

coveralls commented May 18, 2019

Coverage Status

Coverage increased (+0.007%) to 92.076% when pulling cff1dfe on fix_833 into 51d6217 on master.

Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some minor comments.

server/clustering.go Show resolved Hide resolved
server/snapshot.go Outdated Show resolved Hide resolved
This could happen if the leader took a snapshot while messages were
not yet expired, then a node is started without state and tries
to restore from this snapshot. If the messages have expired by then,
no message would be stored. If that node later did a snapshot itself,
it would persist in it the first/last being zero. If no message are
published and this node becomes leader, it would start storing
messages at the wrong sequence and would also send the bad snapshot
to other nodes.

Resolves #833

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
@kozlovic
Copy link
Member Author

@derekcollison I have pushed forced because I did a rebase from master to take the raft shutdown changes. But I kept a separate commit to ease the review. I can squash prior to merge. Commit to review would be this one

- bail out after a number of failed attempts to restore msgs
- create snapshot on success if restore msgs indicates that
  first in channel has moved.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions..

server/snapshot.go Show resolved Hide resolved
// Also don't use startGoRoutine() here since it needs the
// server lock, which we already have.
s.mu.Lock()
s.raft.Snapshot().Error()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't do in place correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meaning without go routine? if so no, since as I tried to explain, as part of the startup process we call NewRaft() (hashicorp raft constructor), which returns a Raft object that we assign to s.raft.Raft at the end of createRaftNode(). NewRaft() is the one invoking Restore() on startup (when there are local snapshots), so I can't call s.raft.Snapshot() until after the creation of the raft node has completed.

server/snapshot.go Show resolved Hide resolved
Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM..

@kozlovic kozlovic merged commit 9177cad into master May 19, 2019
@kozlovic kozlovic deleted the fix_833 branch May 19, 2019 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clustering: still seeing cases where store first/last falls to 0
3 participants