-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Durability API guarantee broken in single node cluster #14370
Comments
Looks the explanation and steps are very tedious for one to look. Instead, please find this. //1. Incorporate this code diff and run etcd-server https://github.com/etcd-io/etcd/compare/release-3.5...hasethuraman:release-3.5?expand=1 //2. Add a key value. Allow etcdserver to acknowledge and exit immediately (with just sleep and exit to simulate the explanation) //3. Remove this control flag file and restart the etcd server //4. Check if key present // We can see no key-value |
It seems that you raised a duplicated issue to #14364. I have already answered the issue in #14364 (comment) and #14364 (comment) |
@ahrtr it is not about HA. I am saying there appears to be a data loss. |
@hasethuraman haven't tried to repro, but I agree with @ahrtr, you should try 3 node cluster. Raft protocol doesn't really work on 1 node. I bet you won't be able to reproduce this. Etcd allows to launch 1 node clusters only to make it easy to experiment with api, not for any production use. |
Thanks @lavacat . I thought this; but isnt the raft message rtt covering up this observation? I am posting the screenshot here that will also give what I think basically wal can go first before the database/raft for strong consistency |
Hey @hasethuraman can you please please not use screenshots. It's much easier to just link the code etcd/server/etcdserver/raft.go Lines 212 to 239 in 4f0e92d
|
I don't think you correctly identified order of operations. Please remember that committing to db is done after HardState is WAL entry is added to WAL. So the invocations you highlighted EDIT: In single node cluster |
I agree that it's a little confusing that etcd returns success/OK, but the data is actually lost, although there is only one member in the cluster. In theory, we can make everything as a synchronous call, and do not respond to the client until everything (including boltDB, WAL) is successfully persisted. Obviously it will cause huge reduce of performance, and the design doesn't make any sense at all. To prevent data loss, etcd has WAL, which is similar to the redo log of MySQL. To prevent one member total down for whatever reason (e.g. disk corruption), we recommend to setup a cluster with at least 3 members. There is no perfect solution, but this is a good solution for now. But you intentionally fail both the WAL persistent and the BoltDB, and also with only one member. So it's expected behavior by design. |
I'm just flabbergasted with the conclusion. This means that single node cluster doesn't provide durability guarantee. Documentation about API guarantees do not state it anywhere https://etcd.io/docs/v3.5/learning/api_guarantees/#durability |
I will take care of this, updating the doc or enhance the existing etcdserver/raft workflow. Please note that it can only happens for a cluster with only one member. |
Kube was designed with the assumption that PUT was durable and etcd was crash consistent for accepted writes, regardless of quorum size. Maybe @xiang90 or @philips could weigh in on whether this was intentional - it is certainly not expected and I'd say my read of the ecosystem over the last 10 years has been that all accepted writes should be crash safe regardless of cluster size. |
Agreed. The only discussion around relaxing single node safety guarantees I was aware of was #11930, and that reinforced my expectation that even single-node etcd servers default to being crash safe w.r.t. successfully persisting to disk prior to treating a write request as a success, and that overriding that default requires the admin explicitly opting into something unsafe ( |
I will deliver a PR to fix this for clusters with only one member. |
@ahrtr Since I did this locally and worked as expected thought of sharing. Please let me know if the fix you are planning is going to be different. |
I don't think the change is expected, because it definitely cause big reduce of performance. |
I just delivered a PR #14394 to fix this issue. Please let me know whether you can still reproduce the issue with the PR. cc @hasethuraman |
Question is how we want to roll out this fix, it's not a breaking change as it restores etcd durability which matches user expectation. However it is expected to come with performance regression. For v3.6 we should make single etcd instances durable, but we should avoid backport being to disruptive. I think it's crucial to do benchmarking to confirm how impact-full is the change. If the change is small, i would to backport it as it is. If it could disrupt larger clusters, I would want to leave a escape hatch flag, so users can return to previous behavior. If regression is very big, we could consider backport would need to be in default off mode. |
Minimal repro, by using pre-existing failpoints in etcdserver:
|
Please see my comments #14394 (comment), #14394 (comment) and #14394 (comment). In short, I think we should backport the fix to 3.5, and probably 3.4.
These steps aren't correct to me. Although you can reproduce this issue easily using these steps, but it isn't stable. We need to make sure both the boltDB and WAL fail to save the data, but the boltDB may save data successfully or the client might get an error response in your steps. The correct steps are,
For anyone reference: https://github.com/etcd-io/gofail/blob/master/doc/design.md |
I think we should fix etcd's RAFT implementation rather than the etcd "apply" layer. The RAFT protocol is pretty explicit about this:
Seems that etcd's RAFT implementation skips the rules needed for 'committing' an entry... So in case of 1-member RAFT, the leader needs to write and flush first - before considering the entry to be "committed". |
Thanks @ptabor for the feedback. The 10.2.1 is for the performance optimization for multi-member cluster, and I don't think it is the root cause of this issue. The performance optimization is based on the key point that a follower must sync the WAL entries before responding to the leader. Anyway, it's for multi-member cluster. For the case of 1-member cluster, I believe the solution "
Anyway, I agree that we should enhance the raft protocol for one-member cluster, but we'd better do it in future instead of now, because we have more higher priority things to do. WDYT? @ptabor @serathius @spzala |
Thank you @ahrtr . My intuition is that 1. and 2. are premature optimisations.
If Raft change is O(10-20) lines change and it will not show the (more-significant) performance degradation, I would prefer that than introducing another 'case' (walNatifyc) on the etcd side. I think it will:
If the applications want's to do 'hedging' (apply early but do not bolt commit when probability of commit is high), such optimisation can be added in etcd layer on top of proper raft protocol (although I don't think it's worth it). Raft change would need to be Config gated to keep the behavior stable - with previous behavior default in 3.4 & 3.5 (overwritten in etcd). |
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
tests: Add linearizability tests scenario for #14370
…equire first and last request to be persisted This assumption is not true during durability issues like etcd-io#14370. In reality we want to avoid situations where WAL is was truncated, for that it's enough that we ensure that first and last operations are present. Found it when running `make test-robustness-issue14370` and instead of getting `Model is not linearizable` I got that assumptions were broken. Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Update the robustness README and fix the #14370 reproduction case
I observed the possibility of data loss and I would like the community to comment / correct me otherwise.
Before explaining that, I would like to explain the happy path when user does a PUT <key, value>. I have tried to only necessary steps to focus this issue. And considered a single etcd instance.
====================================================================================
----------api thread --------------
User calls etcdctl PUT k v
It lands in v3_server.go::put function with the message about k,v
Call delegates to series of function calls and enters v3_server.go::processInternalRaftRequestOnce
It registers for a signal with wait utility against this keyid
Call delegates further to series of function calls and enters raft/node.go::stepWithWaitOption(..message..)
It wraps this message in a msgResult channel and updates its result channel; then sends this message to propc channel.
After sending it waits on msgResult.channel
----------api thread waiting --------------
On seeing a message in propc channel, raft/node.go::run(), it wakes up and sequence of calls adds the message.Entries to raftLog
Notifies the msgResult.channel
----------api thread wakes--------------
10. Upon seeing the msgResult.channel, api thread wakes and returns down the stack back to v3_server.go::processInternalRaftRequestOnce and waits for signal that it registered at step#4
----------api thread waiting --------------
In next iteration of raft/node.go::run(), it gets the entry from raftLog and add it to readyc
etcdserver/raft.go::start wakes up on seeing this entry in readyc and adds this entry to applyc channel
and synchronously writes to wal log ---------------------> wal log
etcdserver/server.go wakes up on seeing entry in applyc channel (added in step #12)
From step#14, the call goes through series of calls and lands in server.go::applyEntryNormal
applyEntryNormal calls applyV3.apply which will eventually puts the KV to mvcc kvstore txn kvindex
applyEntryNormal now sends the signal for this key which is basically to wake up api thread that is waiting in 7
----------api thread wakes--------------
18. User thread here wakes and sends back acknowledgement
----------user sees ok--------------
Batcher flushes the entries added to kvstore txn kvindex to database file. (also this can happen before 18 based on its timer)
Here if step #13 thread is pre-empted and rescheduled by the underlying operating system after completing step #18 and when there is a power failure at the end of step 18 where after user sees error, then the kv is neither written to wal nor to database file
I think this is not seen today because it is a small window where the server has to restart immediately after step 18 (and immediately after step 12 the underlying os must have pre-empted the etcdserver/raft.go::start and added to end of the runnable Q.). Given these multiple conditions, it appears that we dont see data loss.
But it appears from the code that it is possible. To simulate, added sleep after step 12 (also added exit) and 19. I was able to see ok but the data is not in both wal and db.
If I am not correct, my apology and also please correct my understanding.
Before repro please do the changes:
2.Do the code changes in tx.go
Now follow the steps to repro
//1. Start etcd server with changes
//2. Add a key value. Allow etcdserver to acknowledge and exit immediately (with just sleep and exit to simulate the explanation)
$ touch /tmp/exitnow; ./bin/etcdctl put /k1 v1
OK
//3. Remove this control flag file and restart the etcd server
$ rm /tmp/exitnow
//4. Check if key present
$ ./bin/etcdctl get /k --prefix
$
// We can see no key-value
The text was updated successfully, but these errors were encountered: