Watch response traveling back in time when reconnecting member downloads snapshot from the leader #15271

serathius · 2023-02-09T21:56:58Z

What happened?

Recently added failpoints started showing issues rated to etcd recovery from leaders snapshot. #15104 (comment)

In report uploaded by @lavacat we see revision jumping by 10 which resembles previous data inconsistencies where during restore, member applied same entries multiple times.

I was not able to reproduce issue with linearizability model, however In my case recently introduced watch verification is triggered.

    watch.go:104: Expect response revision equal last event mod revision
    watch.go:99: Expect revision to grow by 1, last: 3924, mod: 2171
    watch.go:104: Expect response revision equal last event mod revision
    watch.go:99: Expect revision to grow by 1, last: 3979, mod: 3171

What did you expect to happen?

As #15104 (comment) only introduced a failpoint and didn't make any changes to traffic I would not Snapshot failpoints to cause problems. This a strong sign of etcd issue.

How can we reproduce it (as minimally and precisely as possible)?

Clone latest main branch
Run make gofail-enable && make && make gofail-disable
Uncomment Snapshot scenario from tests/linearizability/linearizability_test.go
Run GO_TEST_FLAGS='-v --count=100 --failfast --run TestLinearizability/Snapshot ' make test-linearizability

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.6.0-alpha.0
Git SHA: 3c64ae4b7
Go Version: go1.19
Go OS/Arch: linux/amd64

Etcd configuration (command line flags or environment variables)

One from Snapshot scenario

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Not relevant

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

serathius · 2023-02-09T21:57:07Z

cc @ahrtr @ptabor

serathius · 2023-02-09T22:05:25Z

cc @logicalhan

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

serathius · 2023-02-09T23:13:19Z

Looks like it reproduces #15273

Problem is not caused by specific golang failpoint injection, but just blackholing traffic for enough time to Snapshot be sent. Managed to reproduce it on v3.5.

Important note. As @lavacat pointed out, blackholing has some issues as it doesn't drop all traffic between members. This should also be investigated as it might not be a failure that can normally happen.

	// Blackholing will cause peers to not be able to use streamWriters registered with member
	// but peer traffic is still possible because member has 'pipeline' with peers
	// TODO: find a way to stop all traffic

serathius · 2023-02-10T10:55:37Z

Ok, managed to reproduce it without linearizability tests. Issue I found that after a split brain, if enough revisions passed (configured via --experimental-snapshot-catchup-entries) reconnecting member will download snapshot from leader. During the recovery watch on reconnecting member will go back in time.

This can be seen in both:

Linearizability test - watch-responses-0.txt Search "mod_revision":33,
Manual reproduction

Repro is pretty complicated, but in boils down to running a 3 node cluster configured like in Procfile with --snapshot-count 100 --experimental-snapshot-catchup-entries 100, watching revisions on all members, and using sudo /sbin/iptables -A INPUT -p tcp --destination-port 12380 -j DROP && sudo conntrack --flush and sudo /sbin/iptables -D INPUT -p tcp --destination-port 12380 -j DROP && sudo conntrack --flush to simulate member being disconnected (at least dropping communication to member). Might require running sudo conntrack --flush again to ensure connections are broken/restored.

When running ./bin/etcdctl watch -w json '' --prefix | jq '.Events[].kv.mod_revision' I got the revisions like below:

Logs from reconnecting member

{"level":"info","ts":"2023-02-10T11:35:46.664192+0100","caller":"etcdserver/server.go:2092","msg":"saved snapshot","snapshot-index":23366}
{"level":"info","ts":"2023-02-10T11:35:46.664217+0100","caller":"etcdserver/server.go:2122","msg":"compacted Raft logs","compact-index":23266}
{"level":"warn","ts":"2023-02-10T11:35:46.73507+0100","caller":"etcdserver/v3_server.go:831","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":3632582161898019144,"retry-timeout":"500ms"}
...
{"level":"warn","ts":"2023-02-10T11:36:32.310554+0100","caller":"etcdserver/v3_server.go:831","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":3632582161898019150,"retry-timeout":"500ms"}
{"level":"info","ts":"2023-02-10T11:36:32.682188+0100","caller":"fileutil/purge.go:85","msg":"purged","path":"infra1.etcd/member/snap/0000000000000004-00000000000058b8.snap"}
{"level":"info","ts":"2023-02-10T11:36:32.682276+0100","caller":"fileutil/purge.go:85","msg":"purged","path":"infra1.etcd/member/snap/0000000000000004-0000000000005a7c.snap"}
...
{"level":"warn","ts":"2023-02-10T11:37:29.904742+0100","caller":"etcdserver/v3_server.go:831","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":3632582161898019158,"retry-timeout":"500ms"}
{"level":"warn","ts":"2023-02-10T11:37:29.991459+0100","caller":"etcdserver/v3_server.go:808","msg":"ignored out-of-date read index response; local node read indexes queueing up and waiting to be in sync with leader","sent-request-id":3632582161898019158,"received-request-id":3632582161898019144}
...
{"level":"warn","ts":"2023-02-10T11:37:29.998761+0100","caller":"etcdserver/v3_server.go:808","msg":"ignored out-of-date read index response; local node read indexes queueing up and waiting to be in sync with leader","sent-request-id":3632582161898019158,"received-request-id":3632582161898019157}
{"level":"info","ts":"2023-02-10T11:37:29.999091+0100","caller":"rafthttp/http.go:257","msg":"receiving database snapshot","local-member-id":"8211f1d0f64f3269","remote-snapshot-sender-id":"91bc3c398fb3c146","incoming-snapshot-index":24248,"incoming-snapshot-message-size-bytes":13798,"incoming-snapshot-message-size":"14 kB"}
{"level":"info","ts":"2023-02-10T11:37:30.010646+0100","caller":"snap/db.go:65","msg":"saved database snapshot to disk","path":"infra1.etcd/member/snap/0000000000005eb8.snap.db","bytes":1425408,"size":"1.4 MB"}
{"level":"info","ts":"2023-02-10T11:37:30.010694+0100","caller":"rafthttp/http.go:286","msg":"received and saved database snapshot","local-member-id":"8211f1d0f64f3269","remote-snapshot-sender-id":"91bc3c398fb3c146","incoming-snapshot-index":24248,"incoming-snapshot-size-bytes":1425408,"incoming-snapshot-size":"1.4 MB","download-took":"11.691036ms"}
{"level":"info","ts":"2023-02-10T11:37:30.010782+0100","logger":"raft","caller":"v3@v3.0.0-20221201111702-eaa6808e1f7a/log.go:346","msg":"log [committed=23849, applied=23849, unstable.offset=23850, len(unstable.Entries)=0] starts to restore snapshot [index: 24248, term: 4]"}
{"level":"info","ts":"2023-02-10T11:37:30.010819+0100","logger":"raft","caller":"v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1697","msg":"8211f1d0f64f3269 switched to configuration voters=(9372538179322589801 10501334649042878790 18249187646912138824)"}
{"level":"info","ts":"2023-02-10T11:37:30.010842+0100","logger":"raft","caller":"v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1653","msg":"8211f1d0f64f3269 [commit: 24248, lastindex: 24248, lastterm: 4] restored snapshot [index: 24248, term: 4]"}
{"level":"info","ts":"2023-02-10T11:37:30.010855+0100","logger":"raft","caller":"v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1563","msg":"8211f1d0f64f3269 [commit: 24248] restored snapshot [index: 24248, term: 4]"}
{"level":"info","ts":"2023-02-10T11:37:30.010943+0100","caller":"etcdserver/server.go:950","msg":"applying snapshot","current-snapshot-index":23770,"current-applied-index":23849,"incoming-leader-snapshot-index":24248,"incoming-leader-snapshot-term":4}
{"level":"info","ts":"2023-02-10T11:37:30.020945+0100","caller":"etcdserver/raft.go:260","msg":"applied incoming Raft snapshot","snapshot-index":24248}
{"level":"info","ts":"2023-02-10T11:37:30.022605+0100","caller":"storage/backend.go:81","msg":"opened backend db","path":"infra1.etcd/member/snap/db","took":"1.478836ms"}
{"level":"info","ts":"2023-02-10T11:37:30.022637+0100","caller":"etcdserver/server.go:997","msg":"restoring lease store"}
{"level":"info","ts":"2023-02-10T11:37:30.026399+0100","caller":"etcdserver/server.go:1001","msg":"restored lease store"}
{"level":"info","ts":"2023-02-10T11:37:30.026416+0100","caller":"etcdserver/server.go:1004","msg":"restoring mvcc store"}
{"level":"info","ts":"2023-02-10T11:37:30.036145+0100","caller":"mvcc/kvstore.go:385","msg":"kvstore restored","current-rev":24225}
{"level":"info","ts":"2023-02-10T11:37:30.036209+0100","caller":"etcdserver/server.go:1012","msg":"restored mvcc store","consistent-index":24248}
{"level":"info","ts":"2023-02-10T11:37:30.036226+0100","caller":"etcdserver/server.go:1032","msg":"restoring alarm store"}
{"level":"info","ts":"2023-02-10T11:37:30.036287+0100","caller":"etcdserver/server.go:1020","msg":"closing old backend file"}
{"level":"info","ts":"2023-02-10T11:37:30.041619+0100","caller":"etcdserver/server.go:1038","msg":"restored alarm store"}
{"level":"info","ts":"2023-02-10T11:37:30.041637+0100","caller":"etcdserver/server.go:1041","msg":"restoring auth store"}
{"level":"info","ts":"2023-02-10T11:37:30.041657+0100","caller":"etcdserver/server.go:1045","msg":"restored auth store"}
{"level":"info","ts":"2023-02-10T11:37:30.041668+0100","caller":"etcdserver/server.go:1048","msg":"restoring v2 store"}
{"level":"info","ts":"2023-02-10T11:37:30.042075+0100","caller":"etcdserver/server.go:1057","msg":"restored v2 store"}
{"level":"info","ts":"2023-02-10T11:37:30.042096+0100","caller":"etcdserver/server.go:1061","msg":"restoring cluster configuration"}
{"level":"info","ts":"2023-02-10T11:37:30.042154+0100","caller":"membership/cluster.go:288","msg":"recovered/added member from store","cluster-id":"ef37ad9dc622a7c4","local-member-id":"8211f1d0f64f3269","recovered-remote-peer-id":"8211f1d0f64f3269","recovered-remote-peer-urls":["http://127.0.0.1:12380"],"recovered-remote-peer-is-learner":false}
{"level":"info","ts":"2023-02-10T11:37:30.042173+0100","caller":"membership/cluster.go:288","msg":"recovered/added member from store","cluster-id":"ef37ad9dc622a7c4","local-member-id":"8211f1d0f64f3269","recovered-remote-peer-id":"91bc3c398fb3c146","recovered-remote-peer-urls":["http://127.0.0.1:22380"],"recovered-remote-peer-is-learner":false}
{"level":"info","ts":"2023-02-10T11:37:30.042186+0100","caller":"membership/cluster.go:288","msg":"recovered/added member from store","cluster-id":"ef37ad9dc622a7c4","local-member-id":"8211f1d0f64f3269","recovered-remote-peer-id":"fd422379fda50e48","recovered-remote-peer-urls":["http://127.0.0.1:32380"],"recovered-remote-peer-is-learner":false}
{"level":"info","ts":"2023-02-10T11:37:30.042205+0100","caller":"membership/cluster.go:298","msg":"set cluster version from store","cluster-version":"3.6"}
{"level":"info","ts":"2023-02-10T11:37:30.042216+0100","caller":"etcdserver/server.go:1065","msg":"restored cluster configuration"}
{"level":"info","ts":"2023-02-10T11:37:30.042228+0100","caller":"etcdserver/server.go:1066","msg":"removing old peers from network"}
{"level":"info","ts":"2023-02-10T11:37:30.042232+0100","caller":"etcdserver/server.go:1022","msg":"closed old backend file"}
{"level":"info","ts":"2023-02-10T11:37:30.04224+0100","caller":"rafthttp/peer.go:316","msg":"stopping remote peer","remote-peer-id":"fd422379fda50e48"}
{"level":"warn","ts":"2023-02-10T11:37:30.042329+0100","caller":"rafthttp/stream.go:286","msg":"closed TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.042352+0100","caller":"rafthttp/stream.go:294","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","remote-peer-id":"fd422379fda50e48"}
{"level":"warn","ts":"2023-02-10T11:37:30.04245+0100","caller":"rafthttp/stream.go:286","msg":"closed TCP streaming connection with remote peer","stream-writer-type":"stream Message","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.042471+0100","caller":"rafthttp/stream.go:294","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"stream Message","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.042505+0100","caller":"rafthttp/pipeline.go:85","msg":"stopped HTTP pipelining with remote peer","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48"}
{"level":"warn","ts":"2023-02-10T11:37:30.042602+0100","caller":"rafthttp/stream.go:421","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48","error":"context canceled"}
{"level":"warn","ts":"2023-02-10T11:37:30.042631+0100","caller":"rafthttp/peer_status.go:66","msg":"peer became inactive (message send to peer failed)","peer-id":"fd422379fda50e48","error":"failed to read fd422379fda50e48 on stream MsgApp v2 (context canceled)"}
{"level":"info","ts":"2023-02-10T11:37:30.042651+0100","caller":"rafthttp/stream.go:442","msg":"stopped stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48"}
{"level":"warn","ts":"2023-02-10T11:37:30.042734+0100","caller":"rafthttp/stream.go:421","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48","error":"context canceled"}
{"level":"info","ts":"2023-02-10T11:37:30.042757+0100","caller":"rafthttp/stream.go:442","msg":"stopped stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.042775+0100","caller":"rafthttp/peer.go:321","msg":"stopped remote peer","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.042792+0100","caller":"rafthttp/transport.go:354","msg":"removed remote peer","local-member-id":"8211f1d0f64f3269","removed-remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.042804+0100","caller":"rafthttp/peer.go:316","msg":"stopping remote peer","remote-peer-id":"91bc3c398fb3c146"}
{"level":"warn","ts":"2023-02-10T11:37:30.043113+0100","caller":"rafthttp/stream.go:286","msg":"closed TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.04314+0100","caller":"rafthttp/stream.go:294","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","remote-peer-id":"91bc3c398fb3c146"}
{"level":"warn","ts":"2023-02-10T11:37:30.043498+0100","caller":"rafthttp/stream.go:286","msg":"closed TCP streaming connection with remote peer","stream-writer-type":"stream Message","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.043525+0100","caller":"rafthttp/stream.go:294","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"stream Message","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.043569+0100","caller":"rafthttp/pipeline.go:85","msg":"stopped HTTP pipelining with remote peer","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146"}
{"level":"warn","ts":"2023-02-10T11:37:30.043632+0100","caller":"rafthttp/stream.go:421","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146","error":"context canceled"}
{"level":"warn","ts":"2023-02-10T11:37:30.043658+0100","caller":"rafthttp/peer_status.go:66","msg":"peer became inactive (message send to peer failed)","peer-id":"91bc3c398fb3c146","error":"failed to read 91bc3c398fb3c146 on stream MsgApp v2 (context canceled)"}
{"level":"info","ts":"2023-02-10T11:37:30.043679+0100","caller":"rafthttp/stream.go:442","msg":"stopped stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146"}
{"level":"warn","ts":"2023-02-10T11:37:30.043753+0100","caller":"rafthttp/stream.go:421","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146","error":"context canceled"}
{"level":"info","ts":"2023-02-10T11:37:30.043778+0100","caller":"rafthttp/stream.go:442","msg":"stopped stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.0438+0100","caller":"rafthttp/peer.go:321","msg":"stopped remote peer","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.043817+0100","caller":"rafthttp/transport.go:354","msg":"removed remote peer","local-member-id":"8211f1d0f64f3269","removed-remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.043832+0100","caller":"etcdserver/server.go:1071","msg":"removed old peers from network"}
{"level":"info","ts":"2023-02-10T11:37:30.043847+0100","caller":"etcdserver/server.go:1072","msg":"adding peers from new cluster configuration"}
{"level":"info","ts":"2023-02-10T11:37:30.043879+0100","caller":"rafthttp/peer.go:133","msg":"starting remote peer","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.043903+0100","caller":"rafthttp/pipeline.go:72","msg":"started HTTP pipelining with remote peer","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.044326+0100","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.044406+0100","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.044452+0100","caller":"rafthttp/peer.go:137","msg":"started remote peer","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.044456+0100","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.044477+0100","caller":"rafthttp/transport.go:317","msg":"added remote peer","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146","remote-peer-urls":["http://127.0.0.1:22380"]}
{"level":"info","ts":"2023-02-10T11:37:30.044498+0100","caller":"rafthttp/peer.go:133","msg":"starting remote peer","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.044514+0100","caller":"rafthttp/pipeline.go:72","msg":"started HTTP pipelining with remote peer","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.044517+0100","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.044656+0100","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.044744+0100","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.044864+0100","caller":"rafthttp/peer_status.go:53","msg":"peer became active","peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.044862+0100","caller":"rafthttp/peer.go:137","msg":"started remote peer","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.044889+0100","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.044888+0100","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.04492+0100","caller":"rafthttp/stream.go:412","msg":"established TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.044878+0100","caller":"rafthttp/stream.go:412","msg":"established TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.044937+0100","caller":"rafthttp/transport.go:317","msg":"added remote peer","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48","remote-peer-urls":["http://127.0.0.1:32380"]}
{"level":"info","ts":"2023-02-10T11:37:30.044999+0100","caller":"etcdserver/server.go:1081","msg":"added peers from new cluster configuration"}
{"level":"info","ts":"2023-02-10T11:37:30.045017+0100","caller":"etcdserver/server.go:958","msg":"applied snapshot","current-snapshot-index":24248,"current-applied-index":24248,"incoming-leader-snapshot-index":24248,"incoming-leader-snapshot-term":4}
{"level":"warn","ts":"2023-02-10T11:37:30.045036+0100","caller":"rafthttp/http.go:413","msg":"failed to find remote peer in cluster","local-member-id":"8211f1d0f64f3269","remote-peer-id-stream-handler":"8211f1d0f64f3269","remote-peer-id-from":"91bc3c398fb3c146","cluster-id":"ef37ad9dc622a7c4"}
{"level":"info","ts":"2023-02-10T11:37:30.045068+0100","caller":"traceutil/trace.go:171","msg":"trace[1982849715] linearizableReadLoop","detail":"{readStateIndex:24248; appliedIndex:23849; }","duration":"6.648345669s","start":"2023-02-10T11:37:23.396691+0100","end":"2023-02-10T11:37:30.045036+0100","steps":["trace[1982849715] 'read index received'  (duration: 6.60229637s)","trace[1982849715] 'applied index is now lower than readState.Index'  (duration: 46.048157ms)"],"step_count":2}
{"level":"info","ts":"2023-02-10T11:37:30.045278+0100","caller":"rafthttp/peer_status.go:53","msg":"peer became active","peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.045292+0100","caller":"rafthttp/stream.go:412","msg":"established TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.045307+0100","caller":"rafthttp/stream.go:412","msg":"established TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.144238+0100","caller":"rafthttp/stream.go:249","msg":"set message encoder","from":"8211f1d0f64f3269","to":"91bc3c398fb3c146","stream-type":"stream Message"}
{"level":"info","ts":"2023-02-10T11:37:30.144275+0100","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream Message","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146"}
{"level":"warn","ts":"2023-02-10T11:37:30.407694+0100","caller":"etcdserver/v3_server.go:808","msg":"ignored out-of-date read index response; local node read indexes queueing up and waiting to be in sync with leader","sent-request-id":3632582161898019159,"received-request-id":3632582161898019158}
{"level":"info","ts":"2023-02-10T11:37:32.035451+0100","caller":"rafthttp/stream.go:249","msg":"set message encoder","from":"8211f1d0f64f3269","to":"91bc3c398fb3c146","stream-type":"stream MsgApp v2"}
{"level":"info","ts":"2023-02-10T11:37:32.035462+0100","caller":"rafthttp/stream.go:249","msg":"set message encoder","from":"8211f1d0f64f3269","to":"fd422379fda50e48","stream-type":"stream Message"}
{"level":"info","ts":"2023-02-10T11:37:32.035487+0100","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:32.0355+0100","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream Message","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:32.035675+0100","caller":"rafthttp/stream.go:249","msg":"set message encoder","from":"8211f1d0f64f3269","to":"fd422379fda50e48","stream-type":"stream MsgApp v2"}
{"level":"info","ts":"2023-02-10T11:37:32.035706+0100","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:32.684715+0100","caller":"fileutil/purge.go:85","msg":"purged","path":"infra1.etcd/member/snap/0000000000000004-0000000000005b46.snap"}
{"level":"info","ts":"2023-02-10T11:37:41.781729+0100","caller":"etcdserver/server.go:1141","msg":"triggering snapshot","local-member-id":"8211f1d0f64f3269","local-member-applied-index":24349,"local-member-snapshot-index":24248,"local-member-snapshot-count":100,"snapshot-forced":false}
{"level":"info","ts":"2023-02-10T11:37:41.794112+0100","caller":"etcdserver/server.go:2092","msg":"saved snapshot","snapshot-index":24349}
{"level":"info","ts":"2023-02-10T11:37:41.794133+0100","caller":"etcdserver/server.go:2122","msg":"compacted Raft logs","compact-index":24249}

ahrtr · 2023-02-10T11:08:33Z

Busy with some personal stuff today, will have a deep dive next week.

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

ahrtr · 2023-02-12T07:36:20Z

Please do not panic, it's a test issue. FYI. #15288

serathius · 2023-02-13T11:23:41Z

This was not confirmed to be fixed. Still reproduces on #15273

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

lavacat · 2023-02-28T08:51:26Z

Still looking into this issue. Current hypothesis is that after snapshot is restored, watches are added as unsynced
But minRev of newly added watch can be very low, because it was always synced and minRev isn't updated.
Next chooseAll picks this low minRev and syncWatchers sends events again.

Going to work on a fix to prove this.

serathius · 2023-02-28T09:01:03Z

Thanks for looking into this.

Problem: during restore in watchableStore.Restore, synced watchers are moved to unsynced. minRev will be behind since it's not updated when watcher stays synced. Solution: update minRev fixes: etcd-io#15271 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>

Problem: during restore in watchableStore.Restore, synced watchers are moved to unsynced. minRev will be behind since it's not updated when watcher stays synced. Solution: update minRev fixes: etcd-io#15271 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com> Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

serathius · 2023-03-16T21:28:13Z

After two weeks decided to look into it myself

Problem: during restore in watchableStore.Restore, synced watchers are moved to unsynced. minRev will be behind since it's not updated when watcher stays synced. Solution: update minRev fixes: etcd-io#15271 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com> Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

serathius · 2023-03-21T11:18:43Z

Fix was backported to both v3.4 and v3.5. Will update changelog before the release.

Problem: during restore in watchableStore.Restore, synced watchers are moved to unsynced. minRev will be behind since it's not updated when watcher stays synced. Solution: update minRev fixes: etcd-io#15271 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com> Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

…upport lower snapshot catchup entries but allow reproducing issue etcd-io#15271

…upport lower snapshot catchup entries but allow reproducing issue etcd-io#15271 Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

Prevent picking a failpoint that waiting till snapshot that doesn't support lower snapshot catchup entries but allow reproducing issue #15271

…upport lower snapshot catchup entries but allow reproducing issue etcd-io#15271 Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

serathius added the type/bug label Feb 9, 2023

serathius added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release/v3.6 release/v3.5 labels Feb 9, 2023

serathius added a commit to serathius/etcd that referenced this issue Feb 9, 2023

Reproduce etcd-io#15271

e6de4c6

serathius added a commit to serathius/etcd that referenced this issue Feb 9, 2023

Reproduce etcd-io#15271

2f3e218

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

serathius changed the title ~~Investigate potential linearizability issue caused by Snapshot failpoints~~ Watch response traveling back in time during reconnecting member snapshot download Feb 10, 2023

serathius changed the title ~~Watch response traveling back in time during reconnecting member snapshot download~~ Watch response traveling back in time when reconnecting member downlaods snapshot from leader Feb 10, 2023

serathius changed the title ~~Watch response traveling back in time when reconnecting member downlaods snapshot from leader~~ Watch response traveling back in time when reconnecting member downloads snapshot from leader Feb 10, 2023

serathius changed the title ~~Watch response traveling back in time when reconnecting member downloads snapshot from leader~~ Watch response traveling back in time when reconnecting member downloads snapshot from the leader Feb 10, 2023

ptabor pinned this issue Feb 10, 2023

ptabor mentioned this issue Feb 10, 2023

clientv3: correct nextRev on receiving progress notification response #15248

Merged

serathius added a commit to serathius/etcd that referenced this issue Feb 10, 2023

Reproduce etcd-io#15271

4c13789

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

serathius mentioned this issue Feb 11, 2023

Replace functional tests with Linearizability #15102

Closed

3 tasks

ahrtr added the area/testing label Feb 12, 2023

ahrtr mentioned this issue Feb 12, 2023

test: correct the logic of comparing revision in linearizablity/watch.go #15288

Merged

ahrtr closed this as completed in #15288 Feb 13, 2023

serathius reopened this Feb 13, 2023

serathius added a commit to serathius/etcd that referenced this issue Feb 13, 2023

Reproduce etcd-io#15271

c7f33dc

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

serathius mentioned this issue Feb 13, 2023

Reproduce #15271 #15273

Closed

serathius added a commit to serathius/etcd that referenced this issue Feb 13, 2023

Reproduce etcd-io#15271

12ffa84

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

serathius added a commit to serathius/etcd that referenced this issue Feb 14, 2023

Reproduce etcd-io#15271

daff6c1

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

serathius added a commit to serathius/etcd that referenced this issue Feb 14, 2023

Reproduce etcd-io#15271

c16a948

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

serathius assigned lavacat Feb 28, 2023

lavacat mentioned this issue Mar 9, 2023

mvcc: update minRev when watcher stays synced #15439

Closed

serathius mentioned this issue Mar 16, 2023

Fix 15271 #15492

Merged

serathius assigned serathius and unassigned lavacat Mar 16, 2023

serathius closed this as completed in #15492 Mar 20, 2023

serathius unpinned this issue Mar 21, 2023

serathius mentioned this issue Mar 30, 2023

Plans for v3.5.8 release #15585

Closed

ptabor mentioned this issue Apr 13, 2023

Membership nomination for @lavacat (Bogdan Kanivets) #15712

Closed

henrybear327 mentioned this issue Apr 28, 2024

[RFC] Attempt to make Blockhole blocking on L7 #17891

Closed

serathius added a commit to serathius/etcd that referenced this issue May 13, 2024

Prevent picking a failpoint that waiting till snapshot that doesn't s…

1b16c3b

…upport lower snapshot catchup entries but allow reproducing issue etcd-io#15271

serathius mentioned this issue May 13, 2024

Prevent picking a failpoint that waiting till snapshot that doesn't support lower snapshot catchup entries but allow reproducing issue #15271 #17989

Merged

serathius added a commit that referenced this issue May 13, 2024

Merge pull request #17989 from serathius/robustness-pick-snapshot

d0ea231

Prevent picking a failpoint that waiting till snapshot that doesn't support lower snapshot catchup entries but allow reproducing issue #15271

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watch response traveling back in time when reconnecting member downloads snapshot from the leader #15271

Watch response traveling back in time when reconnecting member downloads snapshot from the leader #15271

serathius commented Feb 9, 2023

serathius commented Feb 9, 2023

serathius commented Feb 9, 2023

serathius commented Feb 9, 2023 •

edited

Loading

serathius commented Feb 10, 2023 •

edited

Loading

ahrtr commented Feb 10, 2023

ahrtr commented Feb 12, 2023

serathius commented Feb 13, 2023 •

edited

Loading

lavacat commented Feb 28, 2023

serathius commented Feb 28, 2023 •

edited

Loading

serathius commented Mar 16, 2023

serathius commented Mar 21, 2023

Watch response traveling back in time when reconnecting member downloads snapshot from the leader #15271

Watch response traveling back in time when reconnecting member downloads snapshot from the leader #15271

Comments

serathius commented Feb 9, 2023

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

serathius commented Feb 9, 2023

serathius commented Feb 9, 2023

serathius commented Feb 9, 2023 • edited Loading

serathius commented Feb 10, 2023 • edited Loading

ahrtr commented Feb 10, 2023

ahrtr commented Feb 12, 2023

serathius commented Feb 13, 2023 • edited Loading

lavacat commented Feb 28, 2023

serathius commented Feb 28, 2023 • edited Loading

serathius commented Mar 16, 2023

serathius commented Mar 21, 2023

serathius commented Feb 9, 2023 •

edited

Loading

serathius commented Feb 10, 2023 •

edited

Loading

serathius commented Feb 13, 2023 •

edited

Loading

serathius commented Feb 28, 2023 •

edited

Loading