-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
_version does not uniquely identify a particular version of a document #19269
Comments
@seut @aphyr the issue you're observing is due to dirty reads (which can happen in all current ES versions). ES does not provide any stronger read guarantees at the moment. I've quickly hacked an integration test to illustrate why dirty reads are at play here ( https://gist.github.com/ywelsch/8a5334cd59d922f5c48074fec578e71c). Note that we have made some major improvements in ES v5.0.0 to ensure that all replicas have the same data once the cluster is healed and that we don't lose acknowledged writes. We have also ported the published Jepsen scenarios to our testing infrastructure (successfully passing). To verify that we are properly modeling the original Jepsen tests, we are spending some effort as well to update the original tests so that they compile against current ES versions. While we’re constantly improving the resiliency of the system we are also spending some effort on documenting the above read/write guarantees and illustrating them with test cases under simulated conditions (see section “Documentation of guarantees” in our resiliency docs: https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html). |
If your Jepsen tests are passing, you... may want to revisit them. One of the original Jepsen tests was for linearizable operations, and since ES is clearly not linearizable, your version of the tests probably shouldn't pass. |
We have switched optimistic concurrency control (OCC) from the |
What the elasticsearch version for switch to _seq_no , _primary_term ? We are planning to use 6.7. |
@vptech20nn ES 6.6 already provides optimistic concurrency control based on the |
@aphyr recently discovered this resilience issue [https://github.com/crate/crate/issues/3711] while running the jespen test suite against Crate.
After I created an integration test (based on current ES master) [https://github.com/crate/elasticsearch/commit/41ed5ebe7304710fda4de4e69479e17081042c38] out of the relevant jepsen code using your nice network partition simulation helper, I was able to reproduce this error not only using Crate but also using plain Elasticsearch.
I've reproduced this issue on ES 2.3, 5.0-alpha3 & master.
The longer the test is running the more often it will fail, with current default runtime of 180sec it fails almost always on my machine. (the relevant jepsen test is running 360sec)
Currently I've no real idea why this is happening, my guess is that some reads are reading a stale version value but I did not yet figured out how/why.
I've also run this scenario on a single node with one shard because my first guess was that this is maybe not network partition related but this test never failed..
I've read the current ES resilience issues and I couldn't see anything which could be related to this issue, but I'm also not completely sure.
The text was updated successfully, but these errors were encountered: