server: Snapshot after cluster version downgrade #13686

serathius · 2022-02-11T14:11:14Z

Etcd downgrades rely on both cluster version and storage version being lowered lowered. After cluster version is lowered peer-to-peer communication should become backward compatible with older etcd. This is prerequisite to downgrade storage version as there will not be new incompatible entries added to WAL. However WAL can still include older incompatible entries that make downgrading storage version impossible. To remove those entries we need to wait for snapshot. Waiting for automatic snapshot would take us too long (every 10000 proposals), so I propose to snapshot imminently after cluster version is lowered.

There is no internal Raft request for asking every cluster member to snapshot, so we do it in raft apply step.
Not the cleanest implementation, however I didn't find a reasonable way to send a signal from applier to apply function.

cc @ptabor @ahrtr @spzala

codecov-commenter · 2022-02-11T14:44:38Z

Codecov Report

Merging #13686 (e46aa98) into main (bcadd03) will increase coverage by 0.06%.
The diff coverage is 52.94%.

❗ Current head e46aa98 differs from pull request most recent head 863ed5a. Consider uploading reports for the commit 863ed5a to get more accurate results

@@            Coverage Diff             @@
##             main   #13686      +/-   ##
==========================================
+ Coverage   72.51%   72.57%   +0.06%     
==========================================
  Files         465      465              
  Lines       37865    37877      +12     
==========================================
+ Hits        27457    27491      +34     
+ Misses       8620     8588      -32     
- Partials     1788     1798      +10

Flag	Coverage Δ
all	`72.57% <52.94%> (+0.06%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
server/storage/mvcc/index.go	`90.25% <33.33%> (-2.41%)`	⬇️
server/auth/simple_token.go	`88.46% <50.00%> (+7.35%)`	⬆️
server/etcdserver/apply.go	`89.10% <60.00%> (-0.22%)`	⬇️
server/etcdserver/server.go	`84.28% <100.00%> (-0.23%)`	⬇️
client/pkg/v3/tlsutil/tlsutil.go	`83.33% <0.00%> (-8.34%)`	⬇️
server/storage/schema/cindex.go	`95.34% <0.00%> (-4.66%)`	⬇️
client/v3/leasing/cache.go	`87.22% <0.00%> (-4.45%)`	⬇️
server/etcdserver/api/v3rpc/member.go	`93.54% <0.00%> (-3.23%)`	⬇️
client/v3/experimental/recipes/key.go	`75.34% <0.00%> (-2.74%)`	⬇️
server/etcdserver/api/v3rpc/watch.go	`83.55% <0.00%> (-2.35%)`	⬇️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bcadd03...863ed5a. Read the comment docs.

ahrtr · 2022-02-11T22:53:55Z

server/etcdserver/server.go

@@ -291,6 +291,9 @@ type EtcdServer struct {
 	clusterVersionChanged *notify.Notifier

 	*AccessController
+	// shouldSnapshot informs if snapshot should be triggered after apply.
+	// Should only be set within apply.
+	shouldSnapshot bool


Probably it's better to use the versionChanged Notification, see cluster.go#L569.

You can trigger the snapshot in monitorStorageVersion.

We cannot to this without major rewrite.

Problem is that running snapshot requires access to etcdProgress which is passed through apply code. This means signal needs to be passed by checking variable and not receiving notification. Receiving notification requires reader to block on read channel. Apply cannot block on read as doesnt a lot of CPU computation as compared to monitorStorageVersion that just reads from channels in for loop. Not being able to block on read means that there is a chance that we will miss a signal in channel.

Agreed on this from implementation perspective. Please see my next comment.

ahrtr · 2022-02-13T21:38:45Z

server/etcdserver/server.go

@@ -291,6 +291,9 @@ type EtcdServer struct {
 	clusterVersionChanged *notify.Notifier

 	*AccessController
+	// shouldSnapshot informs if snapshot should be triggered after apply.
+	// Should only be set within apply.
+	shouldSnapshot bool


Agreed on this from implementation perspective. Please see my next comment.

ahrtr · 2022-02-13T21:43:33Z

server/etcdserver/server.go

@@ -1081,9 +1084,10 @@ func (s *EtcdServer) applyEntries(ep *etcdProgress, apply *apply) {
 }

 func (s *EtcdServer) triggerSnapshot(ep *etcdProgress) {
-	if ep.appliedi-ep.snapi <= s.Cfg.SnapshotCount {
+	if !s.shouldSnapshot && ep.appliedi-ep.snapi <= s.Cfg.SnapshotCount {


This change may introduce additional readability complex. When other developers read this source code in future, they may need sometime to figure out when will the shouldSnapshot be true or false.

I suggest to keep all all the reasons to create a snapshot in one place, and change the if to

if !s.shouldSnapshot { return }

Agree that shouldSnapshot could be confusing, however I don't agree that we should unify it under variable. I think checking both variable and etcdProgress makes more sense. Renamed to forceSnapshot.

ahrtr

LGTM

server/etcdserver/server.go

ptabor

Thank you!

server/etcdserver/server.go

serathius mentioned this pull request Feb 11, 2022

Plans for v3.6.0 release #13538

Open

serathius force-pushed the snapshot branch from 8a6d145 to 242e7b8 Compare February 11, 2022 14:12

ahrtr reviewed Feb 11, 2022

View reviewed changes

ahrtr reviewed Feb 13, 2022

View reviewed changes

serathius force-pushed the snapshot branch from 242e7b8 to 863ed5a Compare February 14, 2022 13:25

ahrtr approved these changes Feb 14, 2022

View reviewed changes

serathius force-pushed the snapshot branch from 863ed5a to 584ea90 Compare February 14, 2022 22:05

ptabor reviewed Feb 21, 2022

View reviewed changes

server/etcdserver/server.go Outdated Show resolved Hide resolved

ptabor approved these changes Feb 21, 2022

View reviewed changes

ptabor reviewed Feb 21, 2022

View reviewed changes

server/etcdserver/server.go Outdated Show resolved Hide resolved

serathius force-pushed the snapshot branch from 584ea90 to 0e8692b Compare February 21, 2022 14:21

serathius mentioned this pull request Feb 21, 2022

Implement HA E2e for downgrades #13696

Merged

server: Snapshot after cluster version downgrade

a0f26ff

serathius force-pushed the snapshot branch from 0e8692b to a0f26ff Compare February 21, 2022 14:48

serathius merged commit bf0b1a7 into etcd-io:main Feb 21, 2022

serathius deleted the snapshot branch June 15, 2023 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: Snapshot after cluster version downgrade #13686

server: Snapshot after cluster version downgrade #13686

serathius commented Feb 11, 2022 •

edited

Loading

codecov-commenter commented Feb 11, 2022 •

edited

Loading

ahrtr Feb 11, 2022

serathius Feb 13, 2022

ahrtr Feb 13, 2022

ahrtr Feb 13, 2022

ahrtr Feb 13, 2022

serathius Feb 14, 2022

ahrtr left a comment

ptabor left a comment

server: Snapshot after cluster version downgrade #13686

server: Snapshot after cluster version downgrade #13686

Conversation

serathius commented Feb 11, 2022 • edited Loading

codecov-commenter commented Feb 11, 2022 • edited Loading

Codecov Report

ahrtr Feb 11, 2022

Choose a reason for hiding this comment

serathius Feb 13, 2022

Choose a reason for hiding this comment

ahrtr Feb 13, 2022

Choose a reason for hiding this comment

ahrtr Feb 13, 2022

Choose a reason for hiding this comment

ahrtr Feb 13, 2022

Choose a reason for hiding this comment

serathius Feb 14, 2022

Choose a reason for hiding this comment

ahrtr left a comment

Choose a reason for hiding this comment

ptabor left a comment

Choose a reason for hiding this comment

serathius commented Feb 11, 2022 •

edited

Loading

codecov-commenter commented Feb 11, 2022 •

edited

Loading