Cannot restore snapshot on new cluster #78320

cdalexndr · 2021-09-27T13:14:44Z

Elasticsearch version (bin/elasticsearch --version): Version: 7.14.1, Build: default/docker/66b55ebfa59c92c15db3f69a335d500018b3331e/2021-08-26T09:01:05.390870785Z, JVM: 16.0.2

Plugins installed: []

JVM version (java -version): OpenJDK 64-Bit Server VM Temurin-16.0.2+7 (build 16.0.2+7, mixed mode, sharing)

OS version (uname -a if on a Unix-like system): Linux d3463a9ac7de 4.9.0-14-amd64 #1 SMP Debian 4.9.246-2 (2020-12-17) x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

Trying to restore snapshot on a new single node cluster throws error:

"type" : "snapshot_restore_exception",
"reason" : "[backup:snapshot-2021.09.23/esJtA1MeRcenJbz3tkIL2A] cannot restore index [.geoip_databases] because an open index with same name already exists in the cluster. Either close or delete the existing index or restore the index under a different name by providing a rename pattern and replacement name"

Steps to reproduce:

PUT /_snapshot/backup/%3Csnapshot-%7Bnow%2Fd%7D%3E
Create new single node cluster
POST /_snapshot/backup/snapshot-2021.09.23/_restore

Provide logs (if relevant):

Discuss url: https://discuss.elastic.co/t/cannot-restore-snapshot-to-new-single-node-cluster/285025

The text was updated successfully, but these errors were encountered:

dadoonet · 2021-09-27T14:11:19Z

I think that we should add an option to ignore (may be by default) system indices. In which case the .geoip_databases would not be saved within the regular snapshot made by a user.

elasticmachine · 2021-09-29T17:09:05Z

Pinging @elastic/es-distributed (Team:Distributed)

elasticmachine · 2021-09-29T17:09:05Z

Pinging @elastic/es-data-management (Team:Data Management)

gwbrown · 2021-10-01T16:34:00Z

There's a couple concerns here:

When you restore a snapshot, you must restore at least one index. Since the only index that exists in the snapshot is .geoip_databases, you must restore that index. If there were other indices, you could specify them in the indices field when restoring, and would not encounter this problem.
You can tell Elasticsearch that system indices in the cluster should be overwritten with the system indices from the snapshot, which will allow you to restore this snapshot. However, it's opt-in as this could potentially overwrite a lot of cluster configuration.
The first way will tell Elasticsearch that all system indices should be replaced with the ones from the snapshot:

POST /_snapshot/backup/snapshot-2021.09.23/_restore
{
  "include_global_state": true
}

This restores the cluster state (cluster setttings, etc.) as well as system indices.

The second way specifies the features¹ which should have their state overwritten. The features present in the cluster depends on the installed plugins and can be viewed with the Get Features API, or get GETing the snapshot and checking the feature_states field, which will list only the features with indices present in the snapshot. In this case, the feature for the index is named geoip.

We can tell Elasticsearch that only geoip's system indices (not including cluster state) can be restored using the feature_states field on the restore request:

POST /_snapshot/backup/snapshot-2021.09.23/_restore
{
  "feature_states": ["geoip"]
}

I've tested these both locally on a 7.15.0 cluster.

While we have a workaround here, it's clear that the initial user experience isn't the best. I'm not sure how it would be best to improve it without changing the behavior in such a way that it acts dangerously by default, but at a bare minimum we can improve the error message here.

This is done by feature rather than by index as some features have dependencies between their indices, so only restoring some of that feature's indices would result in a broken system. ↩

cdalexndr · 2021-10-01T18:10:33Z

When you restore a snapshot, you must restore at least one index. Since the only index that exists in the snapshot is .geoip_databases, you must restore that index. If there were other indices, you could specify them in the indices field when restoring, and would not encounter this problem.

I was restoring a snapshot that contained multiple indexes. Indeed, I've managed to restore the snapshot by manually specifying the indexes, but this is a workaround.
If the command that made the snapshot required no additional arguments, then the restore command with no arguments should work out of the box!

gwbrown · 2021-10-01T20:47:19Z

If the command that made the snapshot required no additional arguments, then the restore command with no arguments should work out of the box!

While I agree that this would be ideal, there are things that make this difficult, if not impossible, to achieve without sacrificing other critical qualities. Especially when the cluster(s) in question have some amount of configuration already in place.

At snapshot creation time we want to default to capturing as much data as possible: Not just user-created indices, but system-owned indices and global cluster state as well. That way, we can be sure that if the snapshot wasn't configured more precisely, we have whatever the user wanted to save. But when we go to restore that snapshot, we need to make sure that restoring the snapshot won't have any destructive effects on any data already in the cluster unless Elasticsearch has explicitly been told that's okay by an administrator. In order to make that happen without any arguments at all, we'd have to choose between:

Restore all indices, overwriting any that already exist, which silently deletes data.
Only restore indices which don't already exist in the cluster, and simply do not restore any indices which conflict. This means that a snapshot restoration might only partially succeed with very little way of signaling that this has happened.

Both options lead to obscure problems where the data in the system isn't what one would expect. If we encounter a situation where the only way forward is to drop data, we've found that it's best to just raise an error and ask a human rather than guessing what the best thing to do is. Unfortunately, while this frequently averts disaster, it does mean that some of our APIs are picky.

All of this to say: I don't necessarily disagree with you that the current situation isn't too user-friendly and should be improved, but this is a hard problem to solve and figuring out how to improve it is likely to be challenging. We could simply omit this system index, but this error will happen any time you try to restore a snapshot that contains an index that's already present in the restoring cluster, so that would be a band-aid fix for one very particular instance of this problem - a similar situation can occur with the .security index, and that we definitely want to include in snapshots by default as that index contains critical configuration.

cdalexndr · 2021-10-01T21:58:11Z

From a user perspective, I think the most common sense would be to merge the existing data with the snapshot data (by default).
Don't know if it's supported to merge two indices, but I found this: https://discuss.elastic.co/t/merging-two-indexes/8745/2

gwbrown · 2021-10-02T01:08:16Z

While that might seem intuitive, what's the intuitive behavior if those indices have document IDs that conflict? Do you take the one from the cluster or from the snapshot? Or do you merge them? If so, how does that logic work - do fields from the live index or the snapshot take priority? What about all the applications that are out there today that can't handle a foreign process merging documents into an index that they expect to have complete control over?

Regardless of whether it would be intuitive, merging two indices is not something Elasticsearch is capable of at this time or at any time in the near future. The post you link to is effectively reindexing both indices, which will take vastly more time and resources than restoring a snapshot. Being able to merge indices in a more efficient way would be both challenging (see above) and consume a lot of our development resources which could be spent building something else.

To bring this back around to the original issue: I think we can correct this behavior in the next major version to at least be a little more intuitive. For the 7.x series of releases, there's nothing we can do without breaking our backwards compatibility policy. But in 8.0 and later, I believe we can change the behavior on snapshot restoration to not include system indices unless they're requested via the feature_states or include_global_state parameters on the restore call, which also signal that the existing indices should be overwritten. This aligns with some other changes we're looking to make in 8.0.

cdalexndr · 2021-10-02T13:02:17Z

what's the intuitive behavior if those indices have document IDs that conflict? Do you take the one from the cluster or from the snapshot? Or do you merge them?

As the user requested a snapshot restore, the snapshot takes priority, so overwrite. I'm guessing the case where only some fields differ in the same document is very rare, because docs should be immutable (add new doc instead of updating old) so this should cover most user cases (noop for overwrite same doc with same fields and values).

What about all the applications that are out there today that can't handle a foreign process merging documents into an index that they expect to have complete control over?

You provide additional options to the restore command so that they have control over it. But the default behavior should be made so that it works for most user cases.

markwalkom · 2021-10-02T21:14:24Z

Given the proliferation of system indices over 6 and 7.X, perhaps it's worth revisiting some of the longer held defaults snapshot and restore uses (that seem to be causing the issue here)?

…

On Sat, 2 Oct 2021, 23:02 cdalexndr, ***@***.***> wrote: what's the intuitive behavior if those indices have document IDs that conflict? Do you take the one from the cluster or from the snapshot? Or do you merge them? As the user requested a snapshot restore, the snapshot takes priority, so overwrite. I'm guessing the case where only some fields differ in the same document is very rare, because docs should be immutable (add new doc instead of updating old) so this should cover most user cases (noop for overwrite same doc with same fields and values). What about all the applications that are out there today that can't handle a foreign process merging documents into an index that they expect to have complete control over? You provide additional options to the restore command so that they have control over it. But the default behavior should be made so that it works for most user cases. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#78320 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAYJQTVY6JTJFXWE7Y4ZNMDUE37GNANCNFSM5E2RBY7Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

cdalexndr added >bug needs:triage Requires assignment of a team area label labels Sep 27, 2021

elasticmachine added Team:Data Management Meta label for data/management team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Sep 29, 2021

gwbrown mentioned this issue Oct 22, 2021

Adjust snapshot index resolution behavior to be more intuitive #79670

Merged

gwbrown closed this as completed in #79670 Nov 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot restore snapshot on new cluster #78320

Cannot restore snapshot on new cluster #78320

cdalexndr commented Sep 27, 2021 •

edited

Loading

dadoonet commented Sep 27, 2021

elasticmachine commented Sep 29, 2021

elasticmachine commented Sep 29, 2021

gwbrown commented Oct 1, 2021

cdalexndr commented Oct 1, 2021 •

edited

Loading

gwbrown commented Oct 1, 2021 •

edited

Loading

cdalexndr commented Oct 1, 2021

gwbrown commented Oct 2, 2021

cdalexndr commented Oct 2, 2021

markwalkom commented Oct 2, 2021 via email

Cannot restore snapshot on new cluster #78320

Cannot restore snapshot on new cluster #78320

Comments

cdalexndr commented Sep 27, 2021 • edited Loading

dadoonet commented Sep 27, 2021

elasticmachine commented Sep 29, 2021

elasticmachine commented Sep 29, 2021

gwbrown commented Oct 1, 2021

Footnotes

cdalexndr commented Oct 1, 2021 • edited Loading

gwbrown commented Oct 1, 2021 • edited Loading

cdalexndr commented Oct 1, 2021

gwbrown commented Oct 2, 2021

cdalexndr commented Oct 2, 2021

markwalkom commented Oct 2, 2021 via email

cdalexndr commented Sep 27, 2021 •

edited

Loading

cdalexndr commented Oct 1, 2021 •

edited

Loading

gwbrown commented Oct 1, 2021 •

edited

Loading