-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the potential data loss for clusters with only one member #14394
Conversation
Codecov Report
@@ Coverage Diff @@
## main #14394 +/- ##
==========================================
+ Coverage 75.34% 75.38% +0.04%
==========================================
Files 457 457
Lines 37185 37208 +23
==========================================
+ Hits 28016 28049 +33
+ Misses 7405 7394 -11
- Partials 1764 1765 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
cc @ptabor @serathius @spzala This might be an important fix, please take a look, thx. Note that clusters with only one member isn't recommended in production usage in the existing official releases, including 3.5.[0-4] and 3.4.x, because it may cause data loss when etcd crashes and under high load. cc @dims @liggitt |
server/etcdserver/raft.go
Outdated
// It further means the data must have been committed. | ||
// Note: for clusters with multiple members, the raft will never send identical | ||
// unstable entries and committed entries to etcdserver. | ||
func shouldWaitWALSync(unstableEntries []raftpb.Entry, committedEntries []raftpb.Entry) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to have a single code path that is safe regardless of whether we're in multi-server or single server mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not get your point.
For multi-member cluster, there is no need to wait for the WAL sync, and this function will always return false
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two solutions in my mind before delivering this PR.
The first solution is to enhance the existing raft protocol. The existing raft workflow commit each log immediately when it receives each proposal for clusters with only member, because it doesn't need to get confirmation from itself. Accordingly it sends identical unstable logs and committed logs to etcdserver. The solution is to send a message to etcdserver and wait for the confirmation, no matter it's single-server or multi-server. The good side of this solution is that it looks elegant. The bad side it has some impact on the performance, and it also needs to update the stable raft package. It might be what your a single code path
means.
The second solution is what this PR delivers. The good side is that it has little performance impact, and no impact on multi-server clusters at all. The bad side is that it complicates the applying workflow, but it should be accepted.
Eventually I followed the second solution above for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposed fix by @ahrtr makes sense for me.
61dca0c
to
db01837
Compare
db01837
to
2b2bb3e
Compare
Performance comparisonLinux server configuration
Commands:
Result on one-server clusterNote that I tried multiple times, and got stable results. Result on
|
Result on three-server clusterCommands:
Result on
|
Points:
|
Makes sense, looks like the performance regression is mostly visible in 10 percentile of latency distribution. I would expect that much lower 10%ile benefited from lack of durability. I think it's reasonable to trade latency for durability for those requests. I support backporting this change as it is for v3.4 and v3.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, however let's wait for more maintainers to have a look
For a cluster with only one member, the raft always send identical unstable entries and committed entries to etcdserver, and etcd responds to the client once it finishes (actually partially) the applying workflow. When the client receives the response, it doesn't mean etcd has already successfully saved the data, including BoltDB and WAL, because: 1. etcd commits the boltDB transaction periodically instead of on each request; 2. etcd saves WAL entries in parallel with applying the committed entries. Accordingly, it may run into a situation of data loss when the etcd crashes immediately after responding to the client and before the boltDB and WAL successfully save the data to disk. Note that this issue can only happen for clusters with only one member. For clusters with multiple members, it isn't an issue, because etcd will not commit & apply the data before it being replicated to majority members. When the client receives the response, it means the data must have been applied. It further means the data must have been committed. Note: for clusters with multiple members, the raft will never send identical unstable entries and committed entries to etcdserver. Signed-off-by: Benjamin Wang <wachao@vmware.com>
2b2bb3e
to
3243706
Compare
Thanks @serathius for the quick review. Please @ptabor and @spzala take a look, thx |
ack. One more point, the faster the disk I/O, the smaller the performance downgrade. It means when the disk I/O is faster enough, then the performance downgrade should be even smaller. |
I will work with K8s Scalability folks do validate this change for K8s. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahrtr thanks for the great work and benchmark results! We need to add an entry to changelog for this, but that can be done separately. Also, the backport approach sounds good, thanks @serathius
Thanks @spzala . We need to make a decision to merge this one or #14400. Either way, I will update the changelog in separate PR, and after backporting the PR. |
Closing this PR because we eventually merged #14400 . |
I ran this PR against its main merge-base twice (on my 2021 Mac M1 pro), and in both cases this PR was slightly faster, using the benchmark invocation from [^1]. 2819.6 vs 2808.4 2873.1 vs 2835 Full output below. ---- Script: ``` killall etcd rm -rf default.etcd scripts/build.sh nohup ./bin/etcd --quota-backend-bytes=4300000000 & sleep 10 f=bench-$(git log -1 --pretty=%s | sed -E 's/[^A-Za-z0-9]+/_/g').txt go run ./tools/benchmark txn-put --endpoints="http://127.0.0.1:2379" --clients=200 --conns=200 --key-space-size=4000000000 --key-size=128 --val-size=10240 --total=200000 --rate=40000 | tee "${f}" ``` PR: ``` Summary: Total: 70.9320 secs. Slowest: 0.3003 secs. Fastest: 0.0044 secs. Average: 0.0707 secs. Stddev: 0.0437 secs. Requests/sec: 2819.6030 (second run: 2873.0935) Response time histogram: 0.0044 [1] | 0.0340 [2877] | 0.0636 [119485] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.0932 [17436] |∎∎∎∎∎ 0.1228 [27364] |∎∎∎∎∎∎∎∎∎ 0.1524 [20349] |∎∎∎∎∎∎ 0.1820 [10214] |∎∎∎ 0.2116 [1248] | 0.2412 [564] | 0.2707 [318] | 0.3003 [144] | Latency distribution: 10% in 0.0368 secs. 25% in 0.0381 secs. 50% in 0.0416 secs. 75% in 0.0998 secs. 90% in 0.1375 secs. 95% in 0.1571 secs. 99% in 0.1850 secs. 99.9% in 0.2650 secs. ``` main: ``` Summary: Total: 71.2152 secs. Slowest: 0.6926 secs. Fastest: 0.0040 secs. Average: 0.0710 secs. Stddev: 0.0461 secs. Requests/sec: 2808.3903 (second run: 2834.98) Response time histogram: 0.0040 [1] | 0.0728 [125816] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.1417 [59127] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.2105 [13476] |∎∎∎∎ 0.2794 [1125] | 0.3483 [137] | 0.4171 [93] | 0.4860 [193] | 0.5549 [4] | 0.6237 [16] | 0.6926 [12] | Latency distribution: 10% in 0.0367 secs. 25% in 0.0379 secs. 50% in 0.0417 secs. 75% in 0.0993 secs. 90% in 0.1367 secs. 95% in 0.1567 secs. 99% in 0.1957 secs. 99.9% in 0.4361 secs. ``` [^1]: etcd-io#14394 (comment) Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>
I ran this PR against its main merge-base twice (on my 2021 Mac M1 pro), and in both cases this PR was slightly faster, using the benchmark invocation from [^1]. 2819.6 vs 2808.4 2873.1 vs 2835 Full output below. ---- Script: ``` killall etcd rm -rf default.etcd scripts/build.sh nohup ./bin/etcd --quota-backend-bytes=4300000000 & sleep 10 f=bench-$(git log -1 --pretty=%s | sed -E 's/[^A-Za-z0-9]+/_/g').txt go run ./tools/benchmark txn-put --endpoints="http://127.0.0.1:2379" --clients=200 --conns=200 --key-space-size=4000000000 --key-size=128 --val-size=10240 --total=200000 --rate=40000 | tee "${f}" ``` PR: ``` Summary: Total: 70.9320 secs. Slowest: 0.3003 secs. Fastest: 0.0044 secs. Average: 0.0707 secs. Stddev: 0.0437 secs. Requests/sec: 2819.6030 (second run: 2873.0935) Response time histogram: 0.0044 [1] | 0.0340 [2877] | 0.0636 [119485] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.0932 [17436] |∎∎∎∎∎ 0.1228 [27364] |∎∎∎∎∎∎∎∎∎ 0.1524 [20349] |∎∎∎∎∎∎ 0.1820 [10214] |∎∎∎ 0.2116 [1248] | 0.2412 [564] | 0.2707 [318] | 0.3003 [144] | Latency distribution: 10% in 0.0368 secs. 25% in 0.0381 secs. 50% in 0.0416 secs. 75% in 0.0998 secs. 90% in 0.1375 secs. 95% in 0.1571 secs. 99% in 0.1850 secs. 99.9% in 0.2650 secs. ``` main: ``` Summary: Total: 71.2152 secs. Slowest: 0.6926 secs. Fastest: 0.0040 secs. Average: 0.0710 secs. Stddev: 0.0461 secs. Requests/sec: 2808.3903 (second run: 2834.98) Response time histogram: 0.0040 [1] | 0.0728 [125816] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.1417 [59127] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.2105 [13476] |∎∎∎∎ 0.2794 [1125] | 0.3483 [137] | 0.4171 [93] | 0.4860 [193] | 0.5549 [4] | 0.6237 [16] | 0.6926 [12] | Latency distribution: 10% in 0.0367 secs. 25% in 0.0379 secs. 50% in 0.0417 secs. 75% in 0.0993 secs. 90% in 0.1367 secs. 95% in 0.1567 secs. 99% in 0.1957 secs. 99.9% in 0.4361 secs. ``` [^1]: etcd-io#14394 (comment) Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Fix #14370
For a cluster with only one member, the raft always send identical
unstable entries and committed entries to etcdserver, and etcd
responds to the client once it finishes (actually partially) the
applying workflow.
When the client receives the response, it doesn't mean etcd has already
successfully saved the data, including BoltDB and WAL, because:
Accordingly, it may run into a situation of data loss when the etcd crashes
immediately after responding to the client and before the boltDB and WAL
successfully save the data to disk.
Note that this issue can only happen for clusters with only one member.
For clusters with multiple members, it isn't an issue, because etcd will
not commit & apply the data before it being replicated to majority members.
When the client receives the response, it means the data must have been applied.
It further means the data must have been committed.
Note: for clusters with multiple members, the raft will never send identical
unstable entries and committed entries to etcdserver.
Signed-off-by: Benjamin Wang wachao@vmware.com
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.