-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watch response traveling back in time when reconnecting member downloads snapshot from the leader #15271
Comments
cc @logicalhan |
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Looks like it reproduces #15273 Problem is not caused by specific golang failpoint injection, but just blackholing traffic for enough time to Snapshot be sent. Managed to reproduce it on v3.5. Important note. As @lavacat pointed out, blackholing has some issues as it doesn't drop all traffic between members. This should also be investigated as it might not be a failure that can normally happen.
|
Ok, managed to reproduce it without linearizability tests. Issue I found that after a split brain, if enough revisions passed (configured via --experimental-snapshot-catchup-entries) reconnecting member will download snapshot from leader. During the recovery watch on reconnecting member will go back in time. This can be seen in both:
Repro is pretty complicated, but in boils down to running a 3 node cluster configured like in Procfile with When running
Logs from reconnecting member
|
Busy with some personal stuff today, will have a deep dive next week. |
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Please do not panic, it's a test issue. FYI. #15288 |
This was not confirmed to be fixed. Still reproduces on #15273 |
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Still looking into this issue. Current hypothesis is that after snapshot is restored, watches are added as Going to work on a fix to prove this. |
Thanks for looking into this. |
Problem: during restore in watchableStore.Restore, synced watchers are moved to unsynced. minRev will be behind since it's not updated when watcher stays synced. Solution: update minRev fixes: etcd-io#15271 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
Problem: during restore in watchableStore.Restore, synced watchers are moved to unsynced. minRev will be behind since it's not updated when watcher stays synced. Solution: update minRev fixes: etcd-io#15271 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com> Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Problem: during restore in watchableStore.Restore, synced watchers are moved to unsynced. minRev will be behind since it's not updated when watcher stays synced. Solution: update minRev fixes: etcd-io#15271 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com> Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
After two weeks decided to look into it myself |
Problem: during restore in watchableStore.Restore, synced watchers are moved to unsynced. minRev will be behind since it's not updated when watcher stays synced. Solution: update minRev fixes: etcd-io#15271 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com> Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Problem: during restore in watchableStore.Restore, synced watchers are moved to unsynced. minRev will be behind since it's not updated when watcher stays synced. Solution: update minRev fixes: etcd-io#15271 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com> Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Problem: during restore in watchableStore.Restore, synced watchers are moved to unsynced. minRev will be behind since it's not updated when watcher stays synced. Solution: update minRev fixes: etcd-io#15271 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com> Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Fix was backported to both v3.4 and v3.5. Will update changelog before the release. |
Problem: during restore in watchableStore.Restore, synced watchers are moved to unsynced. minRev will be behind since it's not updated when watcher stays synced. Solution: update minRev fixes: etcd-io#15271 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com> Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Problem: during restore in watchableStore.Restore, synced watchers are moved to unsynced. minRev will be behind since it's not updated when watcher stays synced. Solution: update minRev fixes: etcd-io#15271 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com> Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
…upport lower snapshot catchup entries but allow reproducing issue etcd-io#15271
…upport lower snapshot catchup entries but allow reproducing issue etcd-io#15271 Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
…upport lower snapshot catchup entries but allow reproducing issue etcd-io#15271 Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
…upport lower snapshot catchup entries but allow reproducing issue etcd-io#15271 Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
…upport lower snapshot catchup entries but allow reproducing issue etcd-io#15271 Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Prevent picking a failpoint that waiting till snapshot that doesn't support lower snapshot catchup entries but allow reproducing issue #15271
…upport lower snapshot catchup entries but allow reproducing issue etcd-io#15271 Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
What happened?
Recently added failpoints started showing issues rated to etcd recovery from leaders snapshot. #15104 (comment)
In report uploaded by @lavacat we see revision jumping by 10 which resembles previous data inconsistencies where during restore, member applied same entries multiple times.
I was not able to reproduce issue with linearizability model, however In my case recently introduced watch verification is triggered.
What did you expect to happen?
As #15104 (comment) only introduced a failpoint and didn't make any changes to traffic I would not Snapshot failpoints to cause problems. This a strong sign of etcd issue.
How can we reproduce it (as minimally and precisely as possible)?
make gofail-enable && make && make gofail-disable
GO_TEST_FLAGS='-v --count=100 --failfast --run TestLinearizability/Snapshot ' make test-linearizability
Anything else we need to know?
No response
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
No response
The text was updated successfully, but these errors were encountered: