Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible aborted reads with process crashes #14092

Closed
aphyr opened this issue Jun 6, 2022 · 11 comments
Closed

Possible aborted reads with process crashes #14092

aphyr opened this issue Jun 6, 2022 · 11 comments

Comments

@aphyr
Copy link

aphyr commented Jun 6, 2022

What happened?

With etcd 3.5.3 and jetcd 4.1.74.Final, I suspect that the effects of failed transactions can be visible to later readers and writers when processes are killed with kill -9. For example, take this pair of operations from one Jepsen run, in a five-node cluster running on Debian 11 LXC nodes:

            :anomalies {:G1a ({:op {:type :ok,
                                    :f :txn,
                                    :value [[:append 4325 21]
                                            [:r
                                             4325
                                             [5 7 10 11 14 21]]
                                            [:r
                                             4325
                                             [5 7 10 11 14 21]]],
                                    :time 224493456205,
                                    :process 143,
                                    :index 224833},
                               :mop [:r 4325 [5 7 10 11 14 21]],
                               :writer {:type :fail,
                                        :f :txn,
                                        :value [[:append 4325 14]
                                                [:r 4325 nil]
                                                [:r 4324 nil]
                                                [:append 4324 12]],
                                        :time 226412418904,
                                        :process 145,
                                        :error :didn't-succeed,
                                        :index 227733},
                               :element 14}

The :writer transaction began by appending (via CaS) the element 14 to a list stored in key 4325. That transaction returned a TxnResponse to the client rather than throwing, but response.isSucceeded() returned false--we assume that since it did not succeed, it must have aborted. However, the :op transaction performed a read of key 4325 and observed element 14! If the writer did in fact abort, this op would constitute an aborted read.

Here's the full logs from this test run.

What did you expect to happen?

I expect that if a transaction does not succeed, and also does not throw an exception (as jetcd typically does for indefinite errors), it would not be visible to later readers.

How can we reproduce it (as minimally and precisely as possible)?

Clone https://github.com/jepsen-io/etcd at a1bf380a1c09d62bf6bf2e7b97bd02a35902ed36, and run:

lein run test-all -w append --concurrency 2n --time-limit 1000 --rate 1000 --test-count 5 --nemesis kill

This is basically the same test suite we used for the etcd testing back in late 2019--I've just updated it to the latest jetcd, bumped dependencies, and pointed it at 3.5.3 instead.

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.3
Git SHA: 0452feec7
Go Version: go1.16.15
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.5.3
API version: 3.5

Etcd configuration (command line flags or environment variables)

Each node is started without a config file, using CLI flags like ``` etcd --enable-v2 --log-outputs stderr --logger zap --name n1 --listen-peer-urls http://192.168.122.101:2380 --listen-client-urls http://192.168.122.101:2379 --advertise-client-urls http://192.168.122.101:2379 --initial-cluster-state new --initial-advertise-peer-urls http://192.168.122.101:2380 --initial-cluster ```

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

root@n1:/opt/etcd# ./etcdctl member list -w table --endpoints=n1:2379
+------------------+---------+------+-----------------------------+-----------------------------+------------+
|        ID        | STATUS  | NAME |         PEER ADDRS          |        CLIENT ADDRS         | IS LEARNER |
+------------------+---------+------+-----------------------------+-----------------------------+------------+
| 1153c9690d2b2284 | started |   n3 | http://192.168.122.103:2380 | http://192.168.122.103:2379 |      false |
| 4824313a421b2502 | started |   n5 | http://192.168.122.105:2380 | http://192.168.122.105:2379 |      false |
| 4d6e27d122507e9c | started |   n4 | http://192.168.122.104:2380 | http://192.168.122.104:2379 |      false |
| a1ffd5acd6a88a6a | started |   n2 | http://192.168.122.102:2380 | http://192.168.122.102:2379 |      false |
| afa39e55dee6dc2e | started |   n1 | http://192.168.122.101:2380 | http://192.168.122.101:2379 |      false |
+------------------+---------+------+-----------------------------+-----------------------------+------------+

root@n1:/opt/etcd# ./etcdctl --endpoints=http://192.168.122.105:2379,http://192.168.122.103:2379,http://192.168.122.104:2379,http://192.168.122.102:2379,http://192.168.122.101:2379 endpoint status -w table
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://192.168.122.105:2379 | 4824313a421b2502 |   3.5.3 |  4.2 MB |      true |      false |        20 |      60226 |              60226 |        |
| http://192.168.122.103:2379 | 1153c9690d2b2284 |   3.5.3 |  4.2 MB |     false |      false |        20 |      60226 |              60226 |        |
| http://192.168.122.104:2379 | 4d6e27d122507e9c |   3.5.3 |  4.2 MB |     false |      false |        20 |      60228 |              60228 |        |
| http://192.168.122.102:2379 | a1ffd5acd6a88a6a |   3.5.3 |  4.2 MB |     false |      false |        20 |      60228 |              60228 |        |
| http://192.168.122.101:2379 | afa39e55dee6dc2e |   3.5.3 |  4.2 MB |     false |      false |        20 |      60228 |              60228 |        |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Relevant log output

Full logs are in the attached zip file.
@aphyr aphyr added the type/bug label Jun 6, 2022
@serathius
Copy link
Member

serathius commented Jun 7, 2022

From brief read this looks like bug in jetcd and not etcd. Please file it to jetcd repo or provide explanation what is the error in Etcd.

@aphyr
Copy link
Author

aphyr commented Jun 7, 2022

Could you tell me a bit more about why you suspect jetcd is to blame here, rather than etcd itself?

Here's a protocol-level view of the same kind of anomaly. We make a request to conditionally set key 693 to a new value ending in ... 29 iff its revision is still 7191:

Screenshot from 2022-06-07 09-51-12

Four milliseconds later, the server responded with an HTTP2 200 OK response, with grpc-status: 0, containing a mostly-blank Txn,Response message--notably without a succeeded field. I'm not sure, but since I can't find any examples of responses which have succeeded = 0, I suspect that the absence of the succeeded field entirely might be how etcd's protocol encodes a transaction guard failure.

Screenshot from 2022-06-07 10-10-20

Nevertheless, this write of 29 appeared in later reads, which suggests this is an aborted read.

                                    :f :txn,
                                    :value [[:r 697 [4 5 11 12 13 16 17 18 20 22 24 25 27 28 30]]
                                            [:r 693 [1 2 3 4 6 8 11 12 13 15 16 18 20 21 22 23 25 27 29]]
                                            [:r 698 [5 6 7 4 19 20 24 30]]
                                            [:append 697 9]],
                                    :time 28156744830,
                                    :process 8,
                                    :index 36577},

To confirm this, here's a transaction which definitely did fail--again, I believe, because its precondition guard failed to match. Again, HTTP 200, GRPC status 0, a Txn,response message without a succeeded field:

Screenshot from 2022-06-07 10-40-40

Compare this to what happens during a successful request: we get back a Txn,Response which does have a succeeded field, as well as kvs structures containing the results of that transaction:

Screenshot from 2022-06-07 10-13-47

Also compare this to how the protocol represents an exceptional response--for instance, here's a malformed request which caused etcd to return an HTTP 200 with gprc-status: 3 (INVALID_ARGUMENT) and a grpc-message of etcdserver: duplicate key given in txn request:

Screenshot from 2022-06-07 10-24-56

Here's a zip file with tcpdump captures of the client-server traffic on all nodes, if you'd like to see for yourself. The write causing an aborted read is in n5/tcpdump.pcap.

20220607T093354.000-0400.zip

@ahrtr
Copy link
Member

ahrtr commented Jun 12, 2022

response.isSucceeded() returned false--we assume that since it did not succeed, it must have aborted

It isn't correct. It only means whether the compare is evaluated to true or false. See rpc.pb.go#L1371-L1372

@aphyr
Copy link
Author

aphyr commented Jun 13, 2022

Begging your pardon, but are you saying that a transaction whose guard clause evaluates to false should go on to evaluate the transaction's success block? That seems to be at odds with both the transaction documentation and etcd's behavior in, well, pretty much every other response in these tests...

@ahrtr
Copy link
Member

ahrtr commented Jun 13, 2022

Two comments/points:

  1. The TxnResponse.Succeeded only means the result of compare (The If statement);
  2. When the compare is evaluated to false, the Failure block is processed. Otherwise, the Success block is processed.

@aphyr
Copy link
Author

aphyr commented Jun 13, 2022

OK, so... to repeat, we have multiple cases where a transaction's compare failed, but etcd apparently executed the success block anyway--or if it didn't "really" execute, other transactions are able to observe state which could only have arisen from executing the success block. You're saying this is not a bug?

@ahrtr
Copy link
Member

ahrtr commented Jun 13, 2022

If TxnResponse has the header (cluster_id, member_id, revision and raft_term), then it must contain the field succeeded, because its type is bool and it always has a value no matter a program explicitly assigns a value to it or not.

where a transaction's compare failed, but etcd apparently executed the success block anyway

I still am still not convinced. Could you show me the evidence? How did you know transaction's compare failed? Did you get the conclusion because you did not see the field succeeded ?

@aphyr
Copy link
Author

aphyr commented Jun 13, 2022

If TxnResponse has the header (cluster_id, member_id, revision and raft_term), then it must contain the field succeeded, because its type is bool and it always has a value no matter a program explicitly assigns a value to it or not.

Hmm. At the JVM level, in the client, this value is false. That's because it's not present in the protocol-buffer message, and the protocol buffer spec states that the default value for bools is false.

As for why there's no succeeded field in the protobuf-encoded messages... Ah, I've got some speculation there, but I don't think it's actually on-topic. The core problem remains: etcd returns responses without a succeeded field set to true even when they apparently did succeed.

Could you show me the evidence?

I've told you what the Java client is doing, provided full logs of test runs, given you a reproducible automated test suite, shown you Wireshark disassembly of the request and response, and provided pcap files for you to confirm yourself. I'm honestly at a loss: what other kind of evidence are you looking for?

@ahrtr
Copy link
Member

ahrtr commented Jun 14, 2022

OK, based on the proto3 doc, if a scalar message field is set to its default, the value will not be serialized on the wire. It could be the reason why you did not see the field "succeeded" in the TxnResponse.

We have both integration and e2e test cases for transaction, and I have never seen such issue so far. I suspect it's client side issue, including jetcd or the test case.

@stale
Copy link

stale bot commented Sep 21, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Sep 21, 2022
@serathius
Copy link
Member

Closing as a jetcd issue. Feel free to reopen if you find evidence that issue is on etcd server side.

aphyr added a commit to jepsen-io/etcd that referenced this issue Oct 3, 2022
This client shells out to etcdctl and supports the list-append workload
only--I haven't implemented any other features. I'm only doing this
because the etcd team refuse to believe a bug I reported could actually
be in etcd; they insist it must be in jetcd, and the jetcd maintainers
haven't found anything, and <sigh> there goes my entire Sunday

etcd-io/etcd#14092 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants