-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A VM pause (due to GC, high IO load, etc) can cause the loss of inserted documents #10426
Comments
Thx @aphyr . In general I can give a quick answer to, while we research the rest:
This is indeed the current plan. |
We have made some effort to reproduce this failure. In general, we see GC as just another disruption that can happen, the same way we view network issues and file corruptions. If anyone is interested in the work we do there, the org.elasticsearch.test.disruption package and DiscoveryWithServiceDisruptionsTests are a good place to look. In the Jepsen runs that failed for us, Jepsen created an index and have paused the master node's JVM where the primary of one of the index shards was allocated to that master node. At the time the JVM was paused, no other replica of this shard was fully initialized after initial creation. Because the master JVM was pause, other nodes elected another master but that cluster had no copies left for that specific shard. This left the cluster at a red state. When the node is unpaused it rejoins the cluster. The shard is not re-allocated because we require a qurom of copies to assign a primary (in order to make sure we do not reuse a dated copy). As such the cluster stays red and all the data previously indexed into this shard is not available for searches. When we changed Jepsen to wait for all replicas to be assigned before starting the nemsis, the failure doesn't happen anymore. This change, and some other improvements are part this PR to Jepsen. That said, because of the similar nature between GC and an unresponsive network, there is still small window to loose documents which is captured by #7572 and documented on the resiliency status page @aphyr can you confirm that changes in the PR offers the same behavior for you? |
Thanks for this, @bleskes! I have been super busy with a few other issues but this is the last one I have to clear before talks go! I'll take a look tomorrow morning. :) |
I've merged your PR, and can confirm that ES still drops documents when a primary process is paused. {:valid? false,
:lost "#{1761}",
:recovered
"#{0 2..3 8 30 51 73 97 119 141 165 187 211 233 257 279 302 324 348 371 394 436 457 482 504 527 550 572 597 619 642 664 688 711 734 758 781 804 827 850 894 911 934 957 979 1003 1025 1049 1071 1092 1117 1138 1163 1185 1208 1230 1253 1277 1299 1342 1344 1350 1372 1415 1439 1462 1485 1508 1553 1576 1599 1623 1645 1667 1690 1714 1736 1779 1803 1825 1848 1871 1893 1917 1939 1964 1985 2010 2031 2054 2077 2100 2123 2146 2169 2192}",
:ok "#{0..1344 1346..1392 1394..1530 1532..1760 1762..2203}",
:recovered-frac 24/551,
:unexpected-frac 0,
:unexpected "#{}",
:lost-frac 1/2204,
:ok-frac 550/551} |
@aphyr thanks for running it! I think the PR helps remove the index not being in a green state before starting the test as a cause of document loss (not the only cause). I will keep running the test with additional logging to try and reproduce the failure you see. |
Pinging @elastic/es-distributed |
any updates on this? |
The issues found here were caused by problems in both the data replication subsystem as well as the cluster coordination subsystem, upon which the data replication relies as well for correctness. All known issues in this area relating to this problem have since been fixed. As part of the sequence numbers effort, we've introduced primary terms that allow rejecting invalid requests from the logical past. With the new cluster coordination subsystem introduced in ES 7 (#32006), the remaining known coordination-level issues ("Repeated network partitions can cause cluster state updates to be lost") have been fixed as well. |
Following up on #7572 and #10407, I've found that Elasticsearch will lose inserted documents even in the event of a node hiccup due to garbage collection, swapping, disk failure, IO panic, virtual machine pauses, VM migration, etc. https://gist.github.com/aphyr/b8c98e6149bc66a2d839 shows a log where we pause an elasticsearch primary via SIGSTOP and SIGCONT. Even though no operations can take place against the suspended node during this time, and a new primary for the cluster comes to power, it looks like the old primary is still capable of acking inserts which are not replicated to the new primary--somewhere right before or right after the pause. The result is the loss of ~10% of acknowledged inserts.
You can replicate these results with Jepsen (commit e331ff3578), by running
lein test :only elasticsearch.core-test/create-pause
in theelasticsearch
directory.Looking through the Elasticsearch cluster state code (which I am by no means qualified to understand or evaluate), I get the... really vague, probably incorrect impression that Elasticsearch might make a couple assumptions:
Are these at all correct? Have you considered looking in to an epoch/term/generation scheme? If primaries are elected uniquely for a certain epoch, you can tag each operation with that epoch and use it to reject invalid requests from the logical past--invariants around advancing the epoch, in turn, can enforce the logical monotonicity of operations. It might make it easier to tamp down race conditions like this.
The text was updated successfully, but these errors were encountered: