Fix the potential data loss for clusters with only one member #14394

ahrtr · 2022-08-27T22:13:40Z

For a cluster with only one member, the raft always send identical
unstable entries and committed entries to etcdserver, and etcd
responds to the client once it finishes (actually partially) the
applying workflow.

When the client receives the response, it doesn't mean etcd has already
successfully saved the data, including BoltDB and WAL, because:

etcd commits the boltDB transaction periodically instead of on each request;
etcd saves WAL entries in parallel with applying the committed entries.
Accordingly, it may run into a situation of data loss when the etcd crashes
immediately after responding to the client and before the boltDB and WAL
successfully save the data to disk.
Note that this issue can only happen for clusters with only one member.

For clusters with multiple members, it isn't an issue, because etcd will
not commit & apply the data before it being replicated to majority members.
When the client receives the response, it means the data must have been applied.
It further means the data must have been committed.
Note: for clusters with multiple members, the raft will never send identical
unstable entries and committed entries to etcdserver.

Signed-off-by: Benjamin Wang wachao@vmware.com

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

codecov-commenter · 2022-08-27T22:27:19Z

Codecov Report

Merging #14394 (3243706) into main (f56e0d0) will increase coverage by 0.04%.
The diff coverage is 95.55%.

@@            Coverage Diff             @@
##             main   #14394      +/-   ##
==========================================
+ Coverage   75.34%   75.38%   +0.04%     
==========================================
  Files         457      457              
  Lines       37185    37208      +23     
==========================================
+ Hits        28016    28049      +33     
+ Misses       7405     7394      -11     
- Partials     1764     1765       +1

Flag	Coverage Δ
all	`75.38% <95.55%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
server/etcdserver/server.go	`85.59% <91.66%> (+0.18%)`	⬆️
server/etcdserver/raft.go	`89.41% <100.00%> (+0.78%)`	⬆️
client/v3/leasing/util.go	`91.66% <0.00%> (-6.67%)`	⬇️
client/v3/leasing/cache.go	`87.77% <0.00%> (-3.89%)`	⬇️
client/pkg/v3/testutil/recorder.go	`76.27% <0.00%> (-3.39%)`	⬇️
pkg/traceutil/trace.go	`96.15% <0.00%> (-1.93%)`	⬇️
server/etcdserver/api/rafthttp/msgappv2_codec.go	`69.56% <0.00%> (-1.74%)`	⬇️
client/v3/leasing/kv.go	`89.70% <0.00%> (-1.67%)`	⬇️
server/etcdserver/api/v3rpc/interceptor.go	`76.56% <0.00%> (-1.05%)`	⬇️
server/etcdserver/corrupt.go	`88.77% <0.00%> (-0.67%)`	⬇️
... and 12 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

ahrtr · 2022-08-27T22:35:47Z

cc @ptabor @serathius @spzala This might be an important fix, please take a look, thx.

Note that clusters with only one member isn't recommended in production usage in the existing official releases, including 3.5.[0-4] and 3.4.x, because it may cause data loss when etcd crashes and under high load. cc @dims @liggitt

server/etcdserver/raft.go

liggitt · 2022-08-27T22:54:09Z

server/etcdserver/raft.go

+// It further means the data must have been committed.
+// Note: for clusters with multiple members, the raft will never send identical
+// unstable entries and committed entries to etcdserver.
+func shouldWaitWALSync(unstableEntries []raftpb.Entry, committedEntries []raftpb.Entry) bool {


Is there a way to have a single code path that is safe regardless of whether we're in multi-server or single server mode?

I do not get your point.

For multi-member cluster, there is no need to wait for the WAL sync, and this function will always return false.

There are two solutions in my mind before delivering this PR.

The first solution is to enhance the existing raft protocol. The existing raft workflow commit each log immediately when it receives each proposal for clusters with only member, because it doesn't need to get confirmation from itself. Accordingly it sends identical unstable logs and committed logs to etcdserver. The solution is to send a message to etcdserver and wait for the confirmation, no matter it's single-server or multi-server. The good side of this solution is that it looks elegant. The bad side it has some impact on the performance, and it also needs to update the stable raft package. It might be what your a single code path means.

The second solution is what this PR delivers. The good side is that it has little performance impact, and no impact on multi-server clusters at all. The bad side is that it complicates the applying workflow, but it should be accepted.

Eventually I followed the second solution above for now.

Proposed fix by @ahrtr makes sense for me.

server/etcdserver/raft.go

server/etcdserver/raft_test.go

server/etcdserver/raft.go

dims · 2022-08-28T22:39:32Z

cc @chaochn47 @geetasg

ahrtr · 2022-08-29T00:48:24Z

Performance comparison

Linux server configuration

MemTotal:       16423532 kB
16CPU, and each with Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz

Commands:

./etcd  --quota-backend-bytes=4300000000  
./benchmark txn-put --endpoints="http://127.0.0.1:2379" --clients=200 --conns=200 --key-space-size=4000000000 --key-size=128 --val-size=10240  --total=200000 --rate=40000

Result on one-server cluster

Note that I tried multiple times, and got stable results.

Result on `main`

Summary:
  Total:	56.1029 secs.
  Slowest:	0.1439 secs.
  Fastest:	0.0021 secs.
  Average:	0.0559 secs.
  Stddev:	0.0289 secs.
  Requests/sec:	3564.8803

Response time histogram:
  0.0021 [1]	|
  0.0163 [20555]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0304 [14212]	|∎∎∎∎∎∎∎∎∎
  0.0446 [38302]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0588 [56887]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0730 [12060]	|∎∎∎∎∎∎∎∎
  0.0872 [16407]	|∎∎∎∎∎∎∎∎∎∎∎
  0.1013 [25805]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1155 [11993]	|∎∎∎∎∎∎∎∎
  0.1297 [3452]	|∎∎
  0.1439 [326]	|

Latency distribution:
  10% in 0.0161 secs.
  25% in 0.0361 secs.
  50% in 0.0506 secs.
  75% in 0.0785 secs.
  90% in 0.0981 secs.
  95% in 0.1087 secs.
  99% in 0.1203 secs.
  99.9% in 0.1360 secs.

Result on branch `one_member_data_loss`

Summary:
  Total:	59.1221 secs.
  Slowest:	0.1435 secs.
  Fastest:	0.0128 secs.
  Average:	0.0590 secs.
  Stddev:	0.0213 secs.
  Requests/sec:	3382.8273

Response time histogram:
  0.0128 [1]	|
  0.0259 [1029]	|
  0.0390 [36291]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0520 [64848]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0651 [25016]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0782 [26937]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0912 [25053]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1043 [16544]	|∎∎∎∎∎∎∎∎∎∎
  0.1174 [3387]	|∎∎
  0.1305 [764]	|
  0.1435 [130]	|

Latency distribution:
  10% in 0.0361 secs.
  25% in 0.0417 secs.
  50% in 0.0516 secs.
  75% in 0.0754 secs.
  90% in 0.0920 secs.
  95% in 0.0965 secs.
  99% in 0.1116 secs.
  99.9% in 0.1267 secs.

Summary

Overall the performance decreases by about 5.38% ((3564 - 3382) / 3382).

ahrtr · 2022-08-29T02:23:54Z

Result on three-server cluster

Commands:

$ goreman start  

$ ./benchmark txn-put --endpoints="http://127.0.0.1:2379" --clients=200 --conns=200 --key-space-size=1000000000 --key-size=128 --val-size=10240  --total=100000 --rate=40000

Result on `main`

Summary:
  Total:	58.5007 secs.
  Slowest:	0.2334 secs.
  Fastest:	0.0135 secs.
  Average:	0.1168 secs.
  Stddev:	0.0349 secs.
  Requests/sec:	1709.3817

Response time histogram:
  0.0135 [1]	|
  0.0355 [166]	|
  0.0575 [2216]	|∎∎∎
  0.0795 [16128]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1015 [19326]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1235 [14502]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1455 [25602]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1674 [15279]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1894 [5343]	|∎∎∎∎∎∎∎∎
  0.2114 [1140]	|∎
  0.2334 [297]	|

Latency distribution:
  10% in 0.0719 secs.
  25% in 0.0857 secs.
  50% in 0.1208 secs.
  75% in 0.1420 secs.
  90% in 0.1606 secs.
  95% in 0.1727 secs.
  99% in 0.1920 secs.
  99.9% in 0.2294 secs.

Result on branch `one_member_data_loss`

Summary:
  Total:	57.8285 secs.
  Slowest:	0.2478 secs.
  Fastest:	0.0114 secs.
  Average:	0.1155 secs.
  Stddev:	0.0338 secs.
  Requests/sec:	1729.2500

Response time histogram:
  0.0114 [1]	|
  0.0350 [48]	|
  0.0587 [2551]	|∎∎∎
  0.0823 [20336]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1060 [16035]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1296 [21643]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1532 [26055]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1769 [11141]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.2005 [1703]	|∎∎
  0.2241 [465]	|
  0.2478 [22]	|

Latency distribution:
  10% in 0.0714 secs.
  25% in 0.0839 secs.
  50% in 0.1199 secs.
  75% in 0.1410 secs.
  90% in 0.1594 secs.
  95% in 0.1664 secs.
  99% in 0.1883 secs.
  99.9% in 0.2166 secs.

Summary

Overall the performance results are the same.

ahrtr · 2022-08-29T02:28:29Z

Points:

There is no any performance impact on multi-server cluster.
There is slightly downgrade (about 5.38%) of the performance for single-server cluster. Correctness takes precedence over performance. So I think the PR is accepted, and should be cherry picked to 3.5, and probably 3.4.

serathius · 2022-08-29T07:13:07Z

Makes sense, looks like the performance regression is mostly visible in 10 percentile of latency distribution. I would expect that much lower 10%ile benefited from lack of durability. I think it's reasonable to trade latency for durability for those requests.

I support backporting this change as it is for v3.4 and v3.5

server/etcdserver/raft.go

serathius

LGTM, however let's wait for more maintainers to have a look

For a cluster with only one member, the raft always send identical unstable entries and committed entries to etcdserver, and etcd responds to the client once it finishes (actually partially) the applying workflow. When the client receives the response, it doesn't mean etcd has already successfully saved the data, including BoltDB and WAL, because: 1. etcd commits the boltDB transaction periodically instead of on each request; 2. etcd saves WAL entries in parallel with applying the committed entries. Accordingly, it may run into a situation of data loss when the etcd crashes immediately after responding to the client and before the boltDB and WAL successfully save the data to disk. Note that this issue can only happen for clusters with only one member. For clusters with multiple members, it isn't an issue, because etcd will not commit & apply the data before it being replicated to majority members. When the client receives the response, it means the data must have been applied. It further means the data must have been committed. Note: for clusters with multiple members, the raft will never send identical unstable entries and committed entries to etcdserver. Signed-off-by: Benjamin Wang <wachao@vmware.com>

ahrtr · 2022-08-29T07:52:52Z

LGTM, however let's wait for more maintainers to have a look

Thanks @serathius for the quick review. Please @ptabor and @spzala take a look, thx

ahrtr · 2022-08-29T07:58:52Z

Makes sense, looks like the performance regression is mostly visible in 10 percentile of latency distribution. I would expect that much lower 10%ile benefited from lack of durability. I think it's reasonable to trade latency for durability for those requests.

I support backporting this change as it is for v3.4 and v3.5

ack. One more point, the faster the disk I/O, the smaller the performance downgrade. It means when the disk I/O is faster enough, then the performance downgrade should be even smaller.

serathius · 2022-08-29T08:07:09Z

I will work with K8s Scalability folks do validate this change for K8s.

spzala

@ahrtr thanks for the great work and benchmark results! We need to add an entry to changelog for this, but that can be done separately. Also, the backport approach sounds good, thanks @serathius

ahrtr · 2022-08-30T07:46:13Z

@ahrtr thanks for the great work and benchmark results! We need to add an entry to changelog for this, but that can be done separately. Also, the backport approach sounds good, thanks @serathius

Thanks @spzala . We need to make a decision to merge this one or #14400. Either way, I will update the changelog in separate PR, and after backporting the PR.

ahrtr · 2022-09-05T21:06:02Z

Closing this PR because we eventually merged #14400 .

I ran this PR against its main merge-base twice (on my 2021 Mac M1 pro), and in both cases this PR was slightly faster, using the benchmark invocation from [^1]. 2819.6 vs 2808.4 2873.1 vs 2835 Full output below. ---- Script: ``` killall etcd rm -rf default.etcd scripts/build.sh nohup ./bin/etcd --quota-backend-bytes=4300000000 & sleep 10 f=bench-$(git log -1 --pretty=%s | sed -E 's/[^A-Za-z0-9]+/_/g').txt go run ./tools/benchmark txn-put --endpoints="http://127.0.0.1:2379" --clients=200 --conns=200 --key-space-size=4000000000 --key-size=128 --val-size=10240 --total=200000 --rate=40000 | tee "${f}" ``` PR: ``` Summary: Total: 70.9320 secs. Slowest: 0.3003 secs. Fastest: 0.0044 secs. Average: 0.0707 secs. Stddev: 0.0437 secs. Requests/sec: 2819.6030 (second run: 2873.0935) Response time histogram: 0.0044 [1] | 0.0340 [2877] | 0.0636 [119485] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.0932 [17436] |∎∎∎∎∎ 0.1228 [27364] |∎∎∎∎∎∎∎∎∎ 0.1524 [20349] |∎∎∎∎∎∎ 0.1820 [10214] |∎∎∎ 0.2116 [1248] | 0.2412 [564] | 0.2707 [318] | 0.3003 [144] | Latency distribution: 10% in 0.0368 secs. 25% in 0.0381 secs. 50% in 0.0416 secs. 75% in 0.0998 secs. 90% in 0.1375 secs. 95% in 0.1571 secs. 99% in 0.1850 secs. 99.9% in 0.2650 secs. ``` main: ``` Summary: Total: 71.2152 secs. Slowest: 0.6926 secs. Fastest: 0.0040 secs. Average: 0.0710 secs. Stddev: 0.0461 secs. Requests/sec: 2808.3903 (second run: 2834.98) Response time histogram: 0.0040 [1] | 0.0728 [125816] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.1417 [59127] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.2105 [13476] |∎∎∎∎ 0.2794 [1125] | 0.3483 [137] | 0.4171 [93] | 0.4860 [193] | 0.5549 [4] | 0.6237 [16] | 0.6926 [12] | Latency distribution: 10% in 0.0367 secs. 25% in 0.0379 secs. 50% in 0.0417 secs. 75% in 0.0993 secs. 90% in 0.1367 secs. 95% in 0.1567 secs. 99% in 0.1957 secs. 99.9% in 0.4361 secs. ``` [^1]: etcd-io#14394 (comment) Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>

ahrtr mentioned this pull request Aug 27, 2022

Durability API guarantee broken in single node cluster #14370

Closed

ahrtr requested review from ptabor, serathius and spzala August 27, 2022 22:47

liggitt reviewed Aug 27, 2022

View reviewed changes

server/etcdserver/raft.go Outdated Show resolved Hide resolved

liggitt reviewed Aug 27, 2022

View reviewed changes

ahrtr force-pushed the one_member_data_loss branch 3 times, most recently from 61dca0c to db01837 Compare August 28, 2022 06:36

serathius reviewed Aug 28, 2022

View reviewed changes

server/etcdserver/raft.go Outdated Show resolved Hide resolved

serathius reviewed Aug 28, 2022

View reviewed changes

server/etcdserver/raft.go Outdated Show resolved Hide resolved

serathius reviewed Aug 28, 2022

View reviewed changes

server/etcdserver/raft.go Outdated Show resolved Hide resolved

serathius reviewed Aug 28, 2022

View reviewed changes

server/etcdserver/raft_test.go Show resolved Hide resolved

lavacat reviewed Aug 28, 2022

View reviewed changes

server/etcdserver/raft.go Show resolved Hide resolved

ahrtr force-pushed the one_member_data_loss branch from db01837 to 2b2bb3e Compare August 28, 2022 21:49

ahrtr added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Aug 29, 2022

serathius reviewed Aug 29, 2022

View reviewed changes

server/etcdserver/raft.go Outdated Show resolved Hide resolved

serathius reviewed Aug 29, 2022

View reviewed changes

server/etcdserver/raft.go Outdated Show resolved Hide resolved

serathius approved these changes Aug 29, 2022

View reviewed changes

ahrtr force-pushed the one_member_data_loss branch from 2b2bb3e to 3243706 Compare August 29, 2022 07:51

spzala approved these changes Aug 29, 2022

View reviewed changes

ahrtr mentioned this pull request Aug 30, 2022

[Second Solution] Fix the potential data loss for clusters with only one member (simpler solution) #14400

Merged

ahrtr mentioned this pull request Aug 31, 2022

test pr ahrtr/gocontainer#2

Closed

ahrtr closed this Sep 5, 2022

ahrtr mentioned this pull request Sep 9, 2022

raft: don't emit unstable CommittedEntries #14413

Merged

ahrtr mentioned this pull request Oct 17, 2022

raft: make Message.Snapshot nullable, halve struct size #14592

Merged

ahrtr mentioned this pull request Aug 31, 2023

server: optimizing memory overhead of copy operation in ConcurrentReadTxn #16508

Merged

ahrtr mentioned this pull request Jan 26, 2024

Duplicated watch event detected in robustness test #17247

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the potential data loss for clusters with only one member #14394

Fix the potential data loss for clusters with only one member #14394

ahrtr commented Aug 27, 2022

codecov-commenter commented Aug 27, 2022 •

edited

Loading

ahrtr commented Aug 27, 2022

liggitt Aug 27, 2022

ahrtr Aug 27, 2022

ahrtr Aug 27, 2022 •

edited

Loading

serathius Aug 28, 2022

dims commented Aug 28, 2022

ahrtr commented Aug 29, 2022

ahrtr commented Aug 29, 2022

ahrtr commented Aug 29, 2022

serathius commented Aug 29, 2022

serathius left a comment

ahrtr commented Aug 29, 2022

ahrtr commented Aug 29, 2022

serathius commented Aug 29, 2022

spzala left a comment

ahrtr commented Aug 30, 2022

ahrtr commented Sep 5, 2022

Fix the potential data loss for clusters with only one member #14394

Fix the potential data loss for clusters with only one member #14394

Conversation

ahrtr commented Aug 27, 2022

codecov-commenter commented Aug 27, 2022 • edited Loading

Codecov Report

ahrtr commented Aug 27, 2022

liggitt Aug 27, 2022

Choose a reason for hiding this comment

ahrtr Aug 27, 2022

Choose a reason for hiding this comment

ahrtr Aug 27, 2022 • edited Loading

Choose a reason for hiding this comment

serathius Aug 28, 2022

Choose a reason for hiding this comment

dims commented Aug 28, 2022

ahrtr commented Aug 29, 2022

Performance comparison

Result on one-server cluster

Result on main

Result on branch one_member_data_loss

Summary

ahrtr commented Aug 29, 2022

Result on three-server cluster

Result on main

Result on branch one_member_data_loss

Summary

ahrtr commented Aug 29, 2022

serathius commented Aug 29, 2022

serathius left a comment

Choose a reason for hiding this comment

ahrtr commented Aug 29, 2022

ahrtr commented Aug 29, 2022

serathius commented Aug 29, 2022

spzala left a comment

Choose a reason for hiding this comment

ahrtr commented Aug 30, 2022

ahrtr commented Sep 5, 2022

codecov-commenter commented Aug 27, 2022 •

edited

Loading

ahrtr Aug 27, 2022 •

edited

Loading

Result on `main`

Result on branch `one_member_data_loss`

Result on `main`

Result on branch `one_member_data_loss`