-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node crashes can cause data loss #10933
Comments
This may not be helpful - and certainly not a technical answer - but personally I've never expected ES to have ACID like features, it's an indexer not a db, to that end I never use it as a primary store, simply an upstream service providing search services over data I've ingested from elsewhere. |
The thought process we had is that the most common deployments of ES when its introduced to be used is next to a database, a replay-able upstream, or typically in a logging/metrics use case, where the default expected behavior of ES is to be semi geared towards faster insertion rate by not fsync'ing each operation (though one can configure ES to do so). I find it similar to the (very early) decision to have by default 2 copies of the data (1 replica), compared to 3 (2 replicas). Most common use cases of ES won't expect a 3x increase in storage. Years ago, even 2x was a pain to explain as a default, since most people didn't expect such a system to have even a single additional copy of data (and was the cause of some blows thrown ES way that it doesn't use Lucene correctly and thats why there is extra cost of storage, yay). Also, as you mentioned, other systems don't all default to fsync on each write, or block the result until an fsync happened. (on the other hand, there are known issues that need to be fixed regardless of the default, like #7572). @dakrone / @bleskes lets make sure we run these tests regardless and see if nothing else was uncovered here, and update this issue
Why? :), I personally think its a valid discussion to have, and reopened every once in a while to verify that the defaults chosen (sometimes on inception :) ) still make sense. |
I don't think ES promises "ACID like" features, on the other hand they seem not to deliver on what they do promise. |
It doesn't feel to me that we made a false promise? We didn't claim to have 0 data loss, and are very open around what works and what doesn't in our resiliency status page: http://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html. |
The full quote from the ES website is:
I guess that it's up to the reader to infer from the wording that |
"minimize the chance of any data loss" doesn't sound like a 100% promise to me |
Everyone can choose what is to be highlighted, the following is semantically equivalent:
|
A question about this test: Does the data loss occur as the shards are being moved/replicated/new master is being elected? If you replicate the shards across all active nodes before the test starts, does the data loss still occur? |
It's clearly a problem to acknowledge writes which can go on to fail, regardless of whether you expect data loss(something I find comical) or not. Though I suppose the schedules required to guarantee something like that could prove problematic in some general cases. That said, even in a search indexing environment, would you want to lose track of even one document in your index? That sounds like a disaster, at least for the next person who has to find it. To jump on the lexical analysis bandwagon, putting data safety first would mean only acknowledging completed, synced writes; any other situation means that you're putting something other than data safety first, in this case probably write acknowledgement speed; which would be fine if we were frank about it, and even better if we had a strategy to avoid these silent failure cases, at least when we're loading up the database without any particular rush. |
Thanks @aphyr . ACID is beside the point. If you loose 10% of important data that you'd like your users to being to search for that's a breakdown in the contract implicit in an Acknowledged Insert. If you're say an e-retailler are you cool with 10% of you inventory (random set and any given time) being unsearchable? |
(I mean, keep in mind that the actual fraction lost is gonna depend on your failure schedule; jepsen tests are intentionally pathological haha) |
@sdekock you may be surprised but there are also circumstances where fsync out performs directio. In general though most people align fsync/directio/memmap sync in language even though they are different system calls. |
FYI the link to the translog docs are version 1.3 Here are the Elasticsearch docs for version 1.5. The one difference seems to be the transaction log flush threshold has increased. |
Btw. I'd like to ask (like total noob in this area), assuming same setup as in the original post, is there any mechanism after crash recovery that notifies you about possible data loss? Does every crash (one node let's say) mean potential data loss? Any standard way to recover from the ES logs only afterwards? Or do I need to put app logs together with the time of crash? I know its many questions - I said I am a noob:) |
I get why fsync-on-commit isn't the default and I am extremely naive about ES but I don't get why the defaults are 512 MB and 5 seconds and not something smaller, like commit at least once per second? |
What performance hit would ES take if it was changed to do it "the right way" ? If the performance hit is small/acceptable, make it a default. |
Does the test replicate all shards to all data nodes or only to a subset of nodes in the cluster (for example exactly the subset which will randomly crash)? In my understanding the index operation is synchronous and only returns with an OK if all shards received the document. Therefore if there is no network partition the whole cluster has to crash to loose documents this way if all nodes have a replica. Did i miss something? |
This commit makes create, update and delete operations on an index durable by default. The user has the option to opt out to use async translog flushes on a per-request basis by settings `durable=false` on the REST request. Initial benchmarks running on SSDs have show that indexing is about 7% - 10% slower with bulk indexing compared to async translog flushes. This change is orthogonal to the transaction log sync interval and will only sync the transaction log if the operation has not yet been concurrently synced. Ie. if multiple indexing requests are submitted and one operations sync call already persists the operations of others only one sync call is executed. Relates to elastic#10933
This commit makes create, update and delete operations on an index durable by default. The user has the option to opt out to use async translog flushes on a per-index basis by settings `index.translog.durability=request`. Initial benchmarks running on SSDs have show that indexing is about 7% - 10% slower with bulk indexing compared to async translog flushes. This change is orthogonal to the transaction log sync interval and will only sync the transaction log if the operation has not yet been concurrently synced. Ie. if multiple indexing requests are submitted and one operations sync call already persists the operations of others only one sync call is executed. Relates to elastic#10933
This commit makes create, update and delete operations on an index durable by default. The user has the option to opt out to use async translog flushes on a per-index basis by settings `index.translog.durability=request`. Initial benchmarks running on SSDs have show that indexing is about 7% - 10% slower with bulk indexing compared to async translog flushes. This change is orthogonal to the transaction log sync interval and will only sync the transaction log if the operation has not yet been concurrently synced. Ie. if multiple indexing requests are submitted and one operations sync call already persists the operations of others only one sync call is executed. Relates to elastic#10933
Today we are almost intentionally corrupt the translog if we loose a node due to powerloss or similary disasters. In the translog reading code we simply read until we hit an EOF exception ignoring the rest of the translog file once hit. There is no information stored how many records we are expecting or what the last written offset was. This commit restructures the translog to add checkpoints that are written with every sync operation recording the number of synced operations as well as the last synced offset. These checkpoints are also used to identify the actual transaction log file to open instead of relying on directory traversal. This change adds a significant amount of additional checks and pickyness to the translog code. For instance is the translog now associated with a specific engine via a UUID that is written to each translog file as part of it's header. If an engine opens a translog file it was not associated with the operation will fail. Closes to elastic#10933 Relates to elastic#11011
If the Elastic Search is going to have random data loss which cannot be detected that poses huge doubt on Elastic Search as a viable technology for any major business use case. Even if we have a primary store and replicate data in Elastic Search and don't have a way to know that there is data loss it would still cause major concerns as we would never get to know when to when to reindex the data. Could more details about the tests be shared. Are we talking an exceptional stress test scenario or should we take it that there is not really any guarantee around retaining the most recent 5 seconds of data. Also this document seems to be saying otherwise https://www.elastic.co/guide/en/elasticsearch/guide/master/translog.html#img-xlog-pre-refresh The purpose of the translog is to ensure that operations are not lost. This begs the question: how safe is the translog? Writes to a file will not survive a reboot until the file has been fsync'ed to disk. By default, the translog is fsync'ed every 5 seconds. Potentially, we could lose 5 seconds worth of data—if the translog were the only mechanism that we had for dealing with failure. Fortunately, the translog is only part of a much bigger system. Remember that an indexing request is considered successful only after it has completed on both the primary shard and all replica shards. Even if the node holding the primary shard were to suffer catastrophic failure, it would be unlikely to affect the nodes holding the replica shards at the same time. While we could force the translog to fsync more frequently (at the cost of indexing performance), it is unlikely to provide more reliability. If we are willing to take a performance loss what settings need to be tweaked to fsync'ing each operation. |
@yogirackspace Check the linked PR above which exactly addresses your problem. It's labled for Elasticsearch 2.0 |
@mycrEEpy Sorry the link is not obvious. Could you repost the link? In case you are referring to the link I gave as per document it is part of 1.4.0 https://www.elastic.co/guide/en/elasticsearch/guide/master/_elasticsearch_version.html |
@yogirackspace It's this PR #11011 |
Thanks @mycrEEpy. Could it also be confirmed whether on a real time setup if the node holding the primary shard were to suffer catastrophic failure and gets restarted, in case the replica sets don't go down, the data would not be lost? Just want to be clear about data loss occurrence pattern. |
@yogirackspace re:
This is correct. An indexing request goes through the following process:
So as long as the replicas remain alive, the change will be persisted on the replica. |
Hey @aphyr I decided to not hate you for this but instead overhaul a bit how our translog works as well as how we use it. Apparently there are different expectations on the durability aspects of elasticsearch as well as unclear understanding what an async commit / fsync means in terms of durability guarantees. Long story short here are two problems that cause the dataloss:
Both fixes are only in master and targeted for |
I think that what I'd like to see is ES only return OK when things are indeed OK. If things are not OK then allow that to push back to the client. I'm left wondering if part of this issue is related to ES accepting requests when it is not in a position to process them. WDYT? |
Let's say you have configured an index to have 2 replicas per shard. The write However, let's say that the document is indexed on the primary, then the only live replica dies before it can be indexed on the replica. In 1.x, the indexing request is returned as 200 OK (even though potentially the write on the primary could be lost if eg the primary is isolated). Any further indexing requests would time out until at least one replica shard is available again. What has changed in 2.x is that indexing requests now return the number of shards that successfully indexed the document (and how many failed), which at least gives you the information to decide whether you want the write to be retried or not. Note: the default number of replicas per shard is 1, and the quorum requirement is ignored if only 1 replica is configured. Having just two copies of your data (ie the primary and the replica) is sufficient for most use cases, but to make things safer you need at least three (ie two replicas). |
I'm still experiencing lost documents when restarting nodes in 2.0.0-beta1.
On version 1.5.2 I also tried running with Test code: https://gist.github.com/henrikno/e0ebd6804cb62491343c Might there be other things than fsyncing that causes this issue? |
@henrikno Thank you for bringing this issue to our attention. We approach these matters with the utmost seriousness. We are currently taking steps to reproduce, diagnose and resolve the issue that you report. If you have any additional information that you think will be helpful, please send it our way. |
@henrikno another quick question to help us go in the same direction you did. How long did you wait after starting a node and before killing the next one? Was the cluster fully recovered from the previous restart? (i.e., green) |
@jasontedor: I first assumed that with WriteConsistencyLevel.ALL I would need to check @bleskes: I did test both (wait on green, wait on join) on 1.5.2 and lost documents with both. With 2.0.0-beta1 I can only remember having reproduced it by only waiting for the node to join (i.e. red cluster), and not checking successfull != total. I do want to test more but I have limited time this week. If checking successfull == total is a recommended solution for achieving "write consistency all" then that's good, but should be easier/better documented. |
@henrikno thanks for the extra info. At the moment we are still trying to understand what you exactly did - so we can try to reproduce. Can you describe the procedure? I'm looking for something like:
You mention your cluster was red - which is interesting. In what point did it get red? (when one node is down, it should be yellow, since you have 1 replica in your index). |
I've argued this elsewhere, the |
@shikhar we are still trying to figure out what exactly happened here. I'm all for discussing WriteConsistencyLevel but let's please do it on the other ticket. This ticket is already complex. |
You're gonna hate me for this one. I apologize in advance.
On Elasticsearch 1.5.0 (Jepsen ecab97547123a0c88bb39ddd3ba3db873dccf251),
lein test :only elasticsearch.core-test/create-crash
does not affect the network in any way, but instead kills a randomly selected subset of elasticsearch nodes and immediately restarts them, waits for all nodes to recover, then kills a new subset, and so on. As usual, we wait for all nodes to restart and for the cluster status to return green, plus some extra buffer time, before issuing a final read. This failure pattern induces the loss of inserted documents throughout the test: in one particular case, 10% of acknowledged inserts did not appear in the final set.Is this actually a bug? It's not entirely clear to me what kind of crash schedules Elasticsearch should actually tolerate, and the docs seem pretty thin. https://www.elastic.co/products/elasticsearch says
But http://www.elastic.co/guide/en/elasticsearch/reference/1.3/index-modules-translog.html suggests that ES only fsyncs the transaction log every five seconds, so maybe there aren't supposed to be any guarantees around retaining the most recent 5 seconds of data? In that case, why advertise transaction logs as a persistence feature?
Maybe this is super naive of me, but I kinda envisioned "putting data safety first" meaning, well, flushing the transaction log before acknowledging a write to the client. That's the Postgres default, and MySQL's default appears to be write and flush the transaction log for every commit as well. Zookeeper blocks writes for fsync, which makes sense, but Riak's leveldb backend and bitcask backend default to not syncing at all, and Cassandra fsyncs on a schedule rather than blocking writes as well.
Thoughts?
The text was updated successfully, but these errors were encountered: