-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible aborted reads with process crashes #14092
Comments
From brief read this looks like bug in jetcd and not etcd. Please file it to jetcd repo or provide explanation what is the error in Etcd. |
Could you tell me a bit more about why you suspect jetcd is to blame here, rather than etcd itself? Here's a protocol-level view of the same kind of anomaly. We make a request to conditionally set key 693 to a new value ending in ... 29 iff its revision is still 7191: Four milliseconds later, the server responded with an HTTP2 200 OK response, with Nevertheless, this write of 29 appeared in later reads, which suggests this is an aborted read.
To confirm this, here's a transaction which definitely did fail--again, I believe, because its precondition guard failed to match. Again, HTTP 200, GRPC status 0, a Txn,response message without a Compare this to what happens during a successful request: we get back a Txn,Response which does have a Also compare this to how the protocol represents an exceptional response--for instance, here's a malformed request which caused etcd to return an HTTP 200 with Here's a zip file with tcpdump captures of the client-server traffic on all nodes, if you'd like to see for yourself. The write causing an aborted read is in n5/tcpdump.pcap. |
It isn't correct. It only means whether the |
Begging your pardon, but are you saying that a transaction whose guard clause evaluates to |
Two comments/points:
|
OK, so... to repeat, we have multiple cases where a transaction's compare failed, but etcd apparently executed the success block anyway--or if it didn't "really" execute, other transactions are able to observe state which could only have arisen from executing the success block. You're saying this is not a bug? |
If
I still am still not convinced. Could you show me the evidence? How did you know |
Hmm. At the JVM level, in the client, this value is As for why there's no
I've told you what the Java client is doing, provided full logs of test runs, given you a reproducible automated test suite, shown you Wireshark disassembly of the request and response, and provided pcap files for you to confirm yourself. I'm honestly at a loss: what other kind of evidence are you looking for? |
OK, based on the proto3 doc, We have both integration and e2e test cases for transaction, and I have never seen such issue so far. I suspect it's client side issue, including jetcd or the test case. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
Closing as a jetcd issue. Feel free to reopen if you find evidence that issue is on etcd server side. |
This client shells out to etcdctl and supports the list-append workload only--I haven't implemented any other features. I'm only doing this because the etcd team refuse to believe a bug I reported could actually be in etcd; they insist it must be in jetcd, and the jetcd maintainers haven't found anything, and <sigh> there goes my entire Sunday etcd-io/etcd#14092 (comment)
What happened?
With etcd 3.5.3 and jetcd 4.1.74.Final, I suspect that the effects of failed transactions can be visible to later readers and writers when processes are killed with
kill -9
. For example, take this pair of operations from one Jepsen run, in a five-node cluster running on Debian 11 LXC nodes:The
:writer
transaction began by appending (via CaS) the element 14 to a list stored in key 4325. That transaction returned a TxnResponse to the client rather than throwing, butresponse.isSucceeded()
returnedfalse
--we assume that since it did not succeed, it must have aborted. However, the:op
transaction performed a read of key 4325 and observed element 14! If the writer did in fact abort, this op would constitute an aborted read.Here's the full logs from this test run.
What did you expect to happen?
I expect that if a transaction does not succeed, and also does not throw an exception (as jetcd typically does for indefinite errors), it would not be visible to later readers.
How can we reproduce it (as minimally and precisely as possible)?
Clone https://github.com/jepsen-io/etcd at a1bf380a1c09d62bf6bf2e7b97bd02a35902ed36, and run:
This is basically the same test suite we used for the etcd testing back in late 2019--I've just updated it to the latest jetcd, bumped dependencies, and pointed it at 3.5.3 instead.
Anything else we need to know?
No response
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
Full logs are in the attached zip file.
The text was updated successfully, but these errors were encountered: