Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watch response traveling back in time when reconnecting member downloads snapshot from the leader #15271

Closed
serathius opened this issue Feb 9, 2023 · 11 comments · Fixed by #15288 or #15492
Assignees
Labels
area/testing priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release/v3.5 release/v3.6 type/bug

Comments

@serathius
Copy link
Member

What happened?

Recently added failpoints started showing issues rated to etcd recovery from leaders snapshot. #15104 (comment)

In report uploaded by @lavacat we see revision jumping by 10 which resembles previous data inconsistencies where during restore, member applied same entries multiple times.
image

I was not able to reproduce issue with linearizability model, however In my case recently introduced watch verification is triggered.

    watch.go:104: Expect response revision equal last event mod revision
    watch.go:99: Expect revision to grow by 1, last: 3924, mod: 2171
    watch.go:104: Expect response revision equal last event mod revision
    watch.go:99: Expect revision to grow by 1, last: 3979, mod: 3171

What did you expect to happen?

As #15104 (comment) only introduced a failpoint and didn't make any changes to traffic I would not Snapshot failpoints to cause problems. This a strong sign of etcd issue.

How can we reproduce it (as minimally and precisely as possible)?

  1. Clone latest main branch
  2. Run make gofail-enable && make && make gofail-disable
  3. Uncomment Snapshot scenario from tests/linearizability/linearizability_test.go
  4. Run GO_TEST_FLAGS='-v --count=100 --failfast --run TestLinearizability/Snapshot ' make test-linearizability

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.6.0-alpha.0
Git SHA: 3c64ae4b7
Go Version: go1.19
Go OS/Arch: linux/amd64

Etcd configuration (command line flags or environment variables)

One from Snapshot scenario

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Not relevant

Relevant log output

No response

@serathius
Copy link
Member Author

cc @ahrtr @ptabor

@serathius
Copy link
Member Author

cc @logicalhan

@serathius serathius added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release/v3.6 release/v3.5 labels Feb 9, 2023
serathius added a commit to serathius/etcd that referenced this issue Feb 9, 2023
serathius added a commit to serathius/etcd that referenced this issue Feb 9, 2023
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
@serathius
Copy link
Member Author

serathius commented Feb 9, 2023

Looks like it reproduces #15273

Problem is not caused by specific golang failpoint injection, but just blackholing traffic for enough time to Snapshot be sent. Managed to reproduce it on v3.5.

Important note. As @lavacat pointed out, blackholing has some issues as it doesn't drop all traffic between members. This should also be investigated as it might not be a failure that can normally happen.

	// Blackholing will cause peers to not be able to use streamWriters registered with member
	// but peer traffic is still possible because member has 'pipeline' with peers
	// TODO: find a way to stop all traffic

@serathius
Copy link
Member Author

serathius commented Feb 10, 2023

Ok, managed to reproduce it without linearizability tests. Issue I found that after a split brain, if enough revisions passed (configured via --experimental-snapshot-catchup-entries) reconnecting member will download snapshot from leader. During the recovery watch on reconnecting member will go back in time.

This can be seen in both:

Repro is pretty complicated, but in boils down to running a 3 node cluster configured like in Procfile with --snapshot-count 100 --experimental-snapshot-catchup-entries 100, watching revisions on all members, and using sudo /sbin/iptables -A INPUT -p tcp --destination-port 12380 -j DROP && sudo conntrack --flush and sudo /sbin/iptables -D INPUT -p tcp --destination-port 12380 -j DROP && sudo conntrack --flush to simulate member being disconnected (at least dropping communication to member). Might require running sudo conntrack --flush again to ensure connections are broken/restored.

When running ./bin/etcdctl watch -w json '' --prefix | jq '.Events[].kv.mod_revision' I got the revisions like below:

...
23822
23823
23824
23825
23826
23143
23144
23145
23146
23147
23148
23149
23150
...

Logs from reconnecting member

{"level":"info","ts":"2023-02-10T11:35:46.664192+0100","caller":"etcdserver/server.go:2092","msg":"saved snapshot","snapshot-index":23366}
{"level":"info","ts":"2023-02-10T11:35:46.664217+0100","caller":"etcdserver/server.go:2122","msg":"compacted Raft logs","compact-index":23266}
{"level":"warn","ts":"2023-02-10T11:35:46.73507+0100","caller":"etcdserver/v3_server.go:831","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":3632582161898019144,"retry-timeout":"500ms"}
...
{"level":"warn","ts":"2023-02-10T11:36:32.310554+0100","caller":"etcdserver/v3_server.go:831","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":3632582161898019150,"retry-timeout":"500ms"}
{"level":"info","ts":"2023-02-10T11:36:32.682188+0100","caller":"fileutil/purge.go:85","msg":"purged","path":"infra1.etcd/member/snap/0000000000000004-00000000000058b8.snap"}
{"level":"info","ts":"2023-02-10T11:36:32.682276+0100","caller":"fileutil/purge.go:85","msg":"purged","path":"infra1.etcd/member/snap/0000000000000004-0000000000005a7c.snap"}
...
{"level":"warn","ts":"2023-02-10T11:37:29.904742+0100","caller":"etcdserver/v3_server.go:831","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":3632582161898019158,"retry-timeout":"500ms"}
{"level":"warn","ts":"2023-02-10T11:37:29.991459+0100","caller":"etcdserver/v3_server.go:808","msg":"ignored out-of-date read index response; local node read indexes queueing up and waiting to be in sync with leader","sent-request-id":3632582161898019158,"received-request-id":3632582161898019144}
...
{"level":"warn","ts":"2023-02-10T11:37:29.998761+0100","caller":"etcdserver/v3_server.go:808","msg":"ignored out-of-date read index response; local node read indexes queueing up and waiting to be in sync with leader","sent-request-id":3632582161898019158,"received-request-id":3632582161898019157}
{"level":"info","ts":"2023-02-10T11:37:29.999091+0100","caller":"rafthttp/http.go:257","msg":"receiving database snapshot","local-member-id":"8211f1d0f64f3269","remote-snapshot-sender-id":"91bc3c398fb3c146","incoming-snapshot-index":24248,"incoming-snapshot-message-size-bytes":13798,"incoming-snapshot-message-size":"14 kB"}
{"level":"info","ts":"2023-02-10T11:37:30.010646+0100","caller":"snap/db.go:65","msg":"saved database snapshot to disk","path":"infra1.etcd/member/snap/0000000000005eb8.snap.db","bytes":1425408,"size":"1.4 MB"}
{"level":"info","ts":"2023-02-10T11:37:30.010694+0100","caller":"rafthttp/http.go:286","msg":"received and saved database snapshot","local-member-id":"8211f1d0f64f3269","remote-snapshot-sender-id":"91bc3c398fb3c146","incoming-snapshot-index":24248,"incoming-snapshot-size-bytes":1425408,"incoming-snapshot-size":"1.4 MB","download-took":"11.691036ms"}
{"level":"info","ts":"2023-02-10T11:37:30.010782+0100","logger":"raft","caller":"v3@v3.0.0-20221201111702-eaa6808e1f7a/log.go:346","msg":"log [committed=23849, applied=23849, unstable.offset=23850, len(unstable.Entries)=0] starts to restore snapshot [index: 24248, term: 4]"}
{"level":"info","ts":"2023-02-10T11:37:30.010819+0100","logger":"raft","caller":"v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1697","msg":"8211f1d0f64f3269 switched to configuration voters=(9372538179322589801 10501334649042878790 18249187646912138824)"}
{"level":"info","ts":"2023-02-10T11:37:30.010842+0100","logger":"raft","caller":"v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1653","msg":"8211f1d0f64f3269 [commit: 24248, lastindex: 24248, lastterm: 4] restored snapshot [index: 24248, term: 4]"}
{"level":"info","ts":"2023-02-10T11:37:30.010855+0100","logger":"raft","caller":"v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1563","msg":"8211f1d0f64f3269 [commit: 24248] restored snapshot [index: 24248, term: 4]"}
{"level":"info","ts":"2023-02-10T11:37:30.010943+0100","caller":"etcdserver/server.go:950","msg":"applying snapshot","current-snapshot-index":23770,"current-applied-index":23849,"incoming-leader-snapshot-index":24248,"incoming-leader-snapshot-term":4}
{"level":"info","ts":"2023-02-10T11:37:30.020945+0100","caller":"etcdserver/raft.go:260","msg":"applied incoming Raft snapshot","snapshot-index":24248}
{"level":"info","ts":"2023-02-10T11:37:30.022605+0100","caller":"storage/backend.go:81","msg":"opened backend db","path":"infra1.etcd/member/snap/db","took":"1.478836ms"}
{"level":"info","ts":"2023-02-10T11:37:30.022637+0100","caller":"etcdserver/server.go:997","msg":"restoring lease store"}
{"level":"info","ts":"2023-02-10T11:37:30.026399+0100","caller":"etcdserver/server.go:1001","msg":"restored lease store"}
{"level":"info","ts":"2023-02-10T11:37:30.026416+0100","caller":"etcdserver/server.go:1004","msg":"restoring mvcc store"}
{"level":"info","ts":"2023-02-10T11:37:30.036145+0100","caller":"mvcc/kvstore.go:385","msg":"kvstore restored","current-rev":24225}
{"level":"info","ts":"2023-02-10T11:37:30.036209+0100","caller":"etcdserver/server.go:1012","msg":"restored mvcc store","consistent-index":24248}
{"level":"info","ts":"2023-02-10T11:37:30.036226+0100","caller":"etcdserver/server.go:1032","msg":"restoring alarm store"}
{"level":"info","ts":"2023-02-10T11:37:30.036287+0100","caller":"etcdserver/server.go:1020","msg":"closing old backend file"}
{"level":"info","ts":"2023-02-10T11:37:30.041619+0100","caller":"etcdserver/server.go:1038","msg":"restored alarm store"}
{"level":"info","ts":"2023-02-10T11:37:30.041637+0100","caller":"etcdserver/server.go:1041","msg":"restoring auth store"}
{"level":"info","ts":"2023-02-10T11:37:30.041657+0100","caller":"etcdserver/server.go:1045","msg":"restored auth store"}
{"level":"info","ts":"2023-02-10T11:37:30.041668+0100","caller":"etcdserver/server.go:1048","msg":"restoring v2 store"}
{"level":"info","ts":"2023-02-10T11:37:30.042075+0100","caller":"etcdserver/server.go:1057","msg":"restored v2 store"}
{"level":"info","ts":"2023-02-10T11:37:30.042096+0100","caller":"etcdserver/server.go:1061","msg":"restoring cluster configuration"}
{"level":"info","ts":"2023-02-10T11:37:30.042154+0100","caller":"membership/cluster.go:288","msg":"recovered/added member from store","cluster-id":"ef37ad9dc622a7c4","local-member-id":"8211f1d0f64f3269","recovered-remote-peer-id":"8211f1d0f64f3269","recovered-remote-peer-urls":["http://127.0.0.1:12380"],"recovered-remote-peer-is-learner":false}
{"level":"info","ts":"2023-02-10T11:37:30.042173+0100","caller":"membership/cluster.go:288","msg":"recovered/added member from store","cluster-id":"ef37ad9dc622a7c4","local-member-id":"8211f1d0f64f3269","recovered-remote-peer-id":"91bc3c398fb3c146","recovered-remote-peer-urls":["http://127.0.0.1:22380"],"recovered-remote-peer-is-learner":false}
{"level":"info","ts":"2023-02-10T11:37:30.042186+0100","caller":"membership/cluster.go:288","msg":"recovered/added member from store","cluster-id":"ef37ad9dc622a7c4","local-member-id":"8211f1d0f64f3269","recovered-remote-peer-id":"fd422379fda50e48","recovered-remote-peer-urls":["http://127.0.0.1:32380"],"recovered-remote-peer-is-learner":false}
{"level":"info","ts":"2023-02-10T11:37:30.042205+0100","caller":"membership/cluster.go:298","msg":"set cluster version from store","cluster-version":"3.6"}
{"level":"info","ts":"2023-02-10T11:37:30.042216+0100","caller":"etcdserver/server.go:1065","msg":"restored cluster configuration"}
{"level":"info","ts":"2023-02-10T11:37:30.042228+0100","caller":"etcdserver/server.go:1066","msg":"removing old peers from network"}
{"level":"info","ts":"2023-02-10T11:37:30.042232+0100","caller":"etcdserver/server.go:1022","msg":"closed old backend file"}
{"level":"info","ts":"2023-02-10T11:37:30.04224+0100","caller":"rafthttp/peer.go:316","msg":"stopping remote peer","remote-peer-id":"fd422379fda50e48"}
{"level":"warn","ts":"2023-02-10T11:37:30.042329+0100","caller":"rafthttp/stream.go:286","msg":"closed TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.042352+0100","caller":"rafthttp/stream.go:294","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","remote-peer-id":"fd422379fda50e48"}
{"level":"warn","ts":"2023-02-10T11:37:30.04245+0100","caller":"rafthttp/stream.go:286","msg":"closed TCP streaming connection with remote peer","stream-writer-type":"stream Message","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.042471+0100","caller":"rafthttp/stream.go:294","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"stream Message","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.042505+0100","caller":"rafthttp/pipeline.go:85","msg":"stopped HTTP pipelining with remote peer","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48"}
{"level":"warn","ts":"2023-02-10T11:37:30.042602+0100","caller":"rafthttp/stream.go:421","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48","error":"context canceled"}
{"level":"warn","ts":"2023-02-10T11:37:30.042631+0100","caller":"rafthttp/peer_status.go:66","msg":"peer became inactive (message send to peer failed)","peer-id":"fd422379fda50e48","error":"failed to read fd422379fda50e48 on stream MsgApp v2 (context canceled)"}
{"level":"info","ts":"2023-02-10T11:37:30.042651+0100","caller":"rafthttp/stream.go:442","msg":"stopped stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48"}
{"level":"warn","ts":"2023-02-10T11:37:30.042734+0100","caller":"rafthttp/stream.go:421","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48","error":"context canceled"}
{"level":"info","ts":"2023-02-10T11:37:30.042757+0100","caller":"rafthttp/stream.go:442","msg":"stopped stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.042775+0100","caller":"rafthttp/peer.go:321","msg":"stopped remote peer","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.042792+0100","caller":"rafthttp/transport.go:354","msg":"removed remote peer","local-member-id":"8211f1d0f64f3269","removed-remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.042804+0100","caller":"rafthttp/peer.go:316","msg":"stopping remote peer","remote-peer-id":"91bc3c398fb3c146"}
{"level":"warn","ts":"2023-02-10T11:37:30.043113+0100","caller":"rafthttp/stream.go:286","msg":"closed TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.04314+0100","caller":"rafthttp/stream.go:294","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","remote-peer-id":"91bc3c398fb3c146"}
{"level":"warn","ts":"2023-02-10T11:37:30.043498+0100","caller":"rafthttp/stream.go:286","msg":"closed TCP streaming connection with remote peer","stream-writer-type":"stream Message","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.043525+0100","caller":"rafthttp/stream.go:294","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"stream Message","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.043569+0100","caller":"rafthttp/pipeline.go:85","msg":"stopped HTTP pipelining with remote peer","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146"}
{"level":"warn","ts":"2023-02-10T11:37:30.043632+0100","caller":"rafthttp/stream.go:421","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146","error":"context canceled"}
{"level":"warn","ts":"2023-02-10T11:37:30.043658+0100","caller":"rafthttp/peer_status.go:66","msg":"peer became inactive (message send to peer failed)","peer-id":"91bc3c398fb3c146","error":"failed to read 91bc3c398fb3c146 on stream MsgApp v2 (context canceled)"}
{"level":"info","ts":"2023-02-10T11:37:30.043679+0100","caller":"rafthttp/stream.go:442","msg":"stopped stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146"}
{"level":"warn","ts":"2023-02-10T11:37:30.043753+0100","caller":"rafthttp/stream.go:421","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146","error":"context canceled"}
{"level":"info","ts":"2023-02-10T11:37:30.043778+0100","caller":"rafthttp/stream.go:442","msg":"stopped stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.0438+0100","caller":"rafthttp/peer.go:321","msg":"stopped remote peer","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.043817+0100","caller":"rafthttp/transport.go:354","msg":"removed remote peer","local-member-id":"8211f1d0f64f3269","removed-remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.043832+0100","caller":"etcdserver/server.go:1071","msg":"removed old peers from network"}
{"level":"info","ts":"2023-02-10T11:37:30.043847+0100","caller":"etcdserver/server.go:1072","msg":"adding peers from new cluster configuration"}
{"level":"info","ts":"2023-02-10T11:37:30.043879+0100","caller":"rafthttp/peer.go:133","msg":"starting remote peer","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.043903+0100","caller":"rafthttp/pipeline.go:72","msg":"started HTTP pipelining with remote peer","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.044326+0100","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.044406+0100","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.044452+0100","caller":"rafthttp/peer.go:137","msg":"started remote peer","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.044456+0100","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.044477+0100","caller":"rafthttp/transport.go:317","msg":"added remote peer","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146","remote-peer-urls":["http://127.0.0.1:22380"]}
{"level":"info","ts":"2023-02-10T11:37:30.044498+0100","caller":"rafthttp/peer.go:133","msg":"starting remote peer","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.044514+0100","caller":"rafthttp/pipeline.go:72","msg":"started HTTP pipelining with remote peer","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.044517+0100","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.044656+0100","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.044744+0100","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.044864+0100","caller":"rafthttp/peer_status.go:53","msg":"peer became active","peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.044862+0100","caller":"rafthttp/peer.go:137","msg":"started remote peer","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.044889+0100","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.044888+0100","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.04492+0100","caller":"rafthttp/stream.go:412","msg":"established TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.044878+0100","caller":"rafthttp/stream.go:412","msg":"established TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:30.044937+0100","caller":"rafthttp/transport.go:317","msg":"added remote peer","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48","remote-peer-urls":["http://127.0.0.1:32380"]}
{"level":"info","ts":"2023-02-10T11:37:30.044999+0100","caller":"etcdserver/server.go:1081","msg":"added peers from new cluster configuration"}
{"level":"info","ts":"2023-02-10T11:37:30.045017+0100","caller":"etcdserver/server.go:958","msg":"applied snapshot","current-snapshot-index":24248,"current-applied-index":24248,"incoming-leader-snapshot-index":24248,"incoming-leader-snapshot-term":4}
{"level":"warn","ts":"2023-02-10T11:37:30.045036+0100","caller":"rafthttp/http.go:413","msg":"failed to find remote peer in cluster","local-member-id":"8211f1d0f64f3269","remote-peer-id-stream-handler":"8211f1d0f64f3269","remote-peer-id-from":"91bc3c398fb3c146","cluster-id":"ef37ad9dc622a7c4"}
{"level":"info","ts":"2023-02-10T11:37:30.045068+0100","caller":"traceutil/trace.go:171","msg":"trace[1982849715] linearizableReadLoop","detail":"{readStateIndex:24248; appliedIndex:23849; }","duration":"6.648345669s","start":"2023-02-10T11:37:23.396691+0100","end":"2023-02-10T11:37:30.045036+0100","steps":["trace[1982849715] 'read index received'  (duration: 6.60229637s)","trace[1982849715] 'applied index is now lower than readState.Index'  (duration: 46.048157ms)"],"step_count":2}
{"level":"info","ts":"2023-02-10T11:37:30.045278+0100","caller":"rafthttp/peer_status.go:53","msg":"peer became active","peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.045292+0100","caller":"rafthttp/stream.go:412","msg":"established TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.045307+0100","caller":"rafthttp/stream.go:412","msg":"established TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:30.144238+0100","caller":"rafthttp/stream.go:249","msg":"set message encoder","from":"8211f1d0f64f3269","to":"91bc3c398fb3c146","stream-type":"stream Message"}
{"level":"info","ts":"2023-02-10T11:37:30.144275+0100","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream Message","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146"}
{"level":"warn","ts":"2023-02-10T11:37:30.407694+0100","caller":"etcdserver/v3_server.go:808","msg":"ignored out-of-date read index response; local node read indexes queueing up and waiting to be in sync with leader","sent-request-id":3632582161898019159,"received-request-id":3632582161898019158}
{"level":"info","ts":"2023-02-10T11:37:32.035451+0100","caller":"rafthttp/stream.go:249","msg":"set message encoder","from":"8211f1d0f64f3269","to":"91bc3c398fb3c146","stream-type":"stream MsgApp v2"}
{"level":"info","ts":"2023-02-10T11:37:32.035462+0100","caller":"rafthttp/stream.go:249","msg":"set message encoder","from":"8211f1d0f64f3269","to":"fd422379fda50e48","stream-type":"stream Message"}
{"level":"info","ts":"2023-02-10T11:37:32.035487+0100","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","local-member-id":"8211f1d0f64f3269","remote-peer-id":"91bc3c398fb3c146"}
{"level":"info","ts":"2023-02-10T11:37:32.0355+0100","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream Message","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:32.035675+0100","caller":"rafthttp/stream.go:249","msg":"set message encoder","from":"8211f1d0f64f3269","to":"fd422379fda50e48","stream-type":"stream MsgApp v2"}
{"level":"info","ts":"2023-02-10T11:37:32.035706+0100","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","local-member-id":"8211f1d0f64f3269","remote-peer-id":"fd422379fda50e48"}
{"level":"info","ts":"2023-02-10T11:37:32.684715+0100","caller":"fileutil/purge.go:85","msg":"purged","path":"infra1.etcd/member/snap/0000000000000004-0000000000005b46.snap"}
{"level":"info","ts":"2023-02-10T11:37:41.781729+0100","caller":"etcdserver/server.go:1141","msg":"triggering snapshot","local-member-id":"8211f1d0f64f3269","local-member-applied-index":24349,"local-member-snapshot-index":24248,"local-member-snapshot-count":100,"snapshot-forced":false}
{"level":"info","ts":"2023-02-10T11:37:41.794112+0100","caller":"etcdserver/server.go:2092","msg":"saved snapshot","snapshot-index":24349}
{"level":"info","ts":"2023-02-10T11:37:41.794133+0100","caller":"etcdserver/server.go:2122","msg":"compacted Raft logs","compact-index":24249}

@serathius serathius changed the title Investigate potential linearizability issue caused by Snapshot failpoints Watch response traveling back in time during reconnecting member snapshot download Feb 10, 2023
@serathius serathius changed the title Watch response traveling back in time during reconnecting member snapshot download Watch response traveling back in time when reconnecting member downlaods snapshot from leader Feb 10, 2023
@serathius serathius changed the title Watch response traveling back in time when reconnecting member downlaods snapshot from leader Watch response traveling back in time when reconnecting member downloads snapshot from leader Feb 10, 2023
@serathius serathius changed the title Watch response traveling back in time when reconnecting member downloads snapshot from leader Watch response traveling back in time when reconnecting member downloads snapshot from the leader Feb 10, 2023
@ahrtr
Copy link
Member

ahrtr commented Feb 10, 2023

Busy with some personal stuff today, will have a deep dive next week.

@ptabor ptabor pinned this issue Feb 10, 2023
serathius added a commit to serathius/etcd that referenced this issue Feb 10, 2023
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
@ahrtr
Copy link
Member

ahrtr commented Feb 12, 2023

Please do not panic, it's a test issue. FYI. #15288

@serathius
Copy link
Member Author

serathius commented Feb 13, 2023

This was not confirmed to be fixed. Still reproduces on #15273

serathius added a commit to serathius/etcd that referenced this issue Feb 13, 2023
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
serathius added a commit to serathius/etcd that referenced this issue Feb 13, 2023
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
serathius added a commit to serathius/etcd that referenced this issue Feb 14, 2023
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
serathius added a commit to serathius/etcd that referenced this issue Feb 14, 2023
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
@lavacat
Copy link

lavacat commented Feb 28, 2023

Still looking into this issue. Current hypothesis is that after snapshot is restored, watches are added as unsynced
But minRev of newly added watch can be very low, because it was always synced and minRev isn't updated.
Next chooseAll picks this low minRev and syncWatchers sends events again.

Going to work on a fix to prove this.

@serathius
Copy link
Member Author

serathius commented Feb 28, 2023

Thanks for looking into this.

lavacat pushed a commit to lavacat/etcd that referenced this issue Mar 9, 2023
Problem: during restore in watchableStore.Restore, synced watchers are moved to unsynced.
minRev will be behind since it's not updated when watcher stays synced.

Solution: update minRev

fixes: etcd-io#15271
Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
serathius pushed a commit to serathius/etcd that referenced this issue Mar 16, 2023
Problem: during restore in watchableStore.Restore, synced watchers are moved to unsynced.
minRev will be behind since it's not updated when watcher stays synced.

Solution: update minRev

fixes: etcd-io#15271
Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
@serathius serathius mentioned this issue Mar 16, 2023
serathius pushed a commit to serathius/etcd that referenced this issue Mar 16, 2023
Problem: during restore in watchableStore.Restore, synced watchers are moved to unsynced.
minRev will be behind since it's not updated when watcher stays synced.

Solution: update minRev

fixes: etcd-io#15271
Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
@serathius serathius assigned serathius and unassigned lavacat Mar 16, 2023
@serathius
Copy link
Member Author

After two weeks decided to look into it myself

serathius pushed a commit to serathius/etcd that referenced this issue Mar 20, 2023
Problem: during restore in watchableStore.Restore, synced watchers are moved to unsynced.
minRev will be behind since it's not updated when watcher stays synced.

Solution: update minRev

fixes: etcd-io#15271
Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
serathius pushed a commit to serathius/etcd that referenced this issue Mar 20, 2023
Problem: during restore in watchableStore.Restore, synced watchers are moved to unsynced.
minRev will be behind since it's not updated when watcher stays synced.

Solution: update minRev

fixes: etcd-io#15271
Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
serathius pushed a commit to serathius/etcd that referenced this issue Mar 20, 2023
Problem: during restore in watchableStore.Restore, synced watchers are moved to unsynced.
minRev will be behind since it's not updated when watcher stays synced.

Solution: update minRev

fixes: etcd-io#15271
Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
@serathius
Copy link
Member Author

Fix was backported to both v3.4 and v3.5. Will update changelog before the release.

@serathius serathius unpinned this issue Mar 21, 2023
tjungblu pushed a commit to tjungblu/etcd that referenced this issue Jul 26, 2023
Problem: during restore in watchableStore.Restore, synced watchers are moved to unsynced.
minRev will be behind since it's not updated when watcher stays synced.

Solution: update minRev

fixes: etcd-io#15271
Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
tjungblu pushed a commit to tjungblu/etcd that referenced this issue Jul 26, 2023
Problem: during restore in watchableStore.Restore, synced watchers are moved to unsynced.
minRev will be behind since it's not updated when watcher stays synced.

Solution: update minRev

fixes: etcd-io#15271
Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
serathius added a commit to serathius/etcd that referenced this issue May 13, 2024
…upport lower snapshot catchup entries but allow reproducing issue etcd-io#15271
serathius added a commit to serathius/etcd that referenced this issue May 13, 2024
…upport lower snapshot catchup entries but allow reproducing issue etcd-io#15271

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
serathius added a commit to serathius/etcd that referenced this issue May 13, 2024
…upport lower snapshot catchup entries but allow reproducing issue etcd-io#15271

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
serathius added a commit to serathius/etcd that referenced this issue May 13, 2024
…upport lower snapshot catchup entries but allow reproducing issue etcd-io#15271

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
serathius added a commit to serathius/etcd that referenced this issue May 13, 2024
…upport lower snapshot catchup entries but allow reproducing issue etcd-io#15271

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
serathius added a commit that referenced this issue May 13, 2024
Prevent picking a failpoint that waiting till snapshot that doesn't support lower snapshot catchup entries but allow reproducing issue #15271
fykaa pushed a commit to fykaa/etcd that referenced this issue May 19, 2024
…upport lower snapshot catchup entries but allow reproducing issue etcd-io#15271

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/testing priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release/v3.5 release/v3.6 type/bug
3 participants