-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: Snapshot after cluster version downgrade #13686
Conversation
Codecov Report
@@ Coverage Diff @@
## main #13686 +/- ##
==========================================
+ Coverage 72.51% 72.57% +0.06%
==========================================
Files 465 465
Lines 37865 37877 +12
==========================================
+ Hits 27457 27491 +34
+ Misses 8620 8588 -32
- Partials 1788 1798 +10
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
server/etcdserver/server.go
Outdated
@@ -291,6 +291,9 @@ type EtcdServer struct { | |||
clusterVersionChanged *notify.Notifier | |||
|
|||
*AccessController | |||
// shouldSnapshot informs if snapshot should be triggered after apply. | |||
// Should only be set within apply. | |||
shouldSnapshot bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably it's better to use the versionChanged Notification, see cluster.go#L569.
You can trigger the snapshot in monitorStorageVersion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot to this without major rewrite.
Problem is that running snapshot requires access to etcdProgress which is passed through apply code. This means signal needs to be passed by checking variable and not receiving notification. Receiving notification requires reader to block on read channel. Apply cannot block on read as doesnt a lot of CPU computation as compared to monitorStorageVersion that just reads from channels in for loop. Not being able to block on read means that there is a chance that we will miss a signal in channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on this from implementation perspective. Please see my next comment.
server/etcdserver/server.go
Outdated
@@ -291,6 +291,9 @@ type EtcdServer struct { | |||
clusterVersionChanged *notify.Notifier | |||
|
|||
*AccessController | |||
// shouldSnapshot informs if snapshot should be triggered after apply. | |||
// Should only be set within apply. | |||
shouldSnapshot bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on this from implementation perspective. Please see my next comment.
server/etcdserver/server.go
Outdated
@@ -1081,9 +1084,10 @@ func (s *EtcdServer) applyEntries(ep *etcdProgress, apply *apply) { | |||
} | |||
|
|||
func (s *EtcdServer) triggerSnapshot(ep *etcdProgress) { | |||
if ep.appliedi-ep.snapi <= s.Cfg.SnapshotCount { | |||
if !s.shouldSnapshot && ep.appliedi-ep.snapi <= s.Cfg.SnapshotCount { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change may introduce additional readability complex. When other developers read this source code in future, they may need sometime to figure out when will the shouldSnapshot
be true
or false
.
I suggest to keep all all the reasons to create a snapshot in one place, and change the if to
if !s.shouldSnapshot {
return
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that shouldSnapshot
could be confusing, however I don't agree that we should unify it under variable. I think checking both variable and etcdProgress makes more sense. Renamed to forceSnapshot
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Part of #13168
Etcd downgrades rely on both cluster version and storage version being lowered lowered. After cluster version is lowered peer-to-peer communication should become backward compatible with older etcd. This is prerequisite to downgrade storage version as there will not be new incompatible entries added to WAL. However WAL can still include older incompatible entries that make downgrading storage version impossible. To remove those entries we need to wait for snapshot. Waiting for automatic snapshot would take us too long (every 10000 proposals), so I propose to snapshot imminently after cluster version is lowered.
There is no internal Raft request for asking every cluster member to snapshot, so we do it in raft apply step.
Not the cleanest implementation, however I didn't find a reasonable way to send a signal from applier to apply function.
cc @ptabor @ahrtr @spzala