Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple repositories in get snapshot request #41799

Closed

Conversation

andrershov
Copy link
Contributor

@andrershov andrershov commented May 3, 2019

Description

This commit adds multiple repositories support to get snapshots request.
This is the continuation of work done here #15151.

_snapshot API

Now the following requests are supported:

GET /_snapshot/[_all | * | repo* | repo1,repo2]/[_all | * | snap* | snap1,snap2]

This commit adds repository name to each SnapshotInfo returned.
Previously, snapshots in the response were sorted by start time and then by snapshot id.
This commit sorts the snapshots first by the repository name, then start time and then by snapshot id.

When particular snapshot names are specified instead of * or _all and ignoreUnavailable=false, then all these snapshots should exist in all repositories mentioned. This might be undesirable, probably reviewers can propose something better. Currently, you can specify ignoreUnvailable=true, in this case, it's not a problem.

_cat/snapshots API

Cat snapshot API also supports multiple repositories now.

GET /_cat/snapshots/[_all | * | repo* | repo1,repo2]

I'm not sure if it's needed and this change can be reverted.

GET /_cat/snapshots

is also supported and will return all snapshots for all repositories.
Interestingly enough, the previous implementation also was registering /_cat/snapshots filter but a request to this endpoint always resulted in

{
    "error": {
        "root_cause": [
            {
                "type": "action_request_validation_exception",
                "reason": "Validation Failed: 1: repository is missing;"
            }
        ],
        "type": "action_request_validation_exception",
        "reason": "Validation Failed: 1: repository is missing;"
    },
    "status": 400
}

Implementation

Technically, TransportGetSnapshotsAction calls TransportGetRepositoriesAction which resolves repo names and wildcards in request to concrete repository names. After that, we just iterate over all returned repositories and are using the already implemented algorithm to get matching snapshots in the repositories.
In terms of the transport level protocol, 2 classes are changed:

  • GetSnapshotsRequest now accepts the list of repository names (including wildcards) instead of single repo name.
  • SnapshotInfo is extended to include repo name.

In terms of REST protocol, 2 classes are changed:

  • SnapshotRequestsConverters.getSnapshots now can understand the list of repository names in GetSnapshotRequest and adds it as comma separated list to the request path.
  • RestGetSnapshotsAction can now parse the comma-separated list of the repo names.
  • SnapshotInfo XContent serialization adds repository field.

RestSnapshotAction which is responsible for handling cat API for the snapshots is extended with repository field as well.

Testing

  • SharedClusterSnapshotRestoreIT.testGetSnapshotsMultipleRepos tests that multiple repositories and wildcard work fine on transport level.
  • SnapshotRequestConvertersTests.getSnapshots tests that high-level REST client correctly sends the request when multiple repositories are used in the request.
  • SnapshotIT.testGetSnapshots tests getting 2 snapshots from 2 different repositories using high-level REST client.

Documentation

Seems that it makes sense to make adjustments to the following documentation pages:

  • cat-snapshots
  • modules-snapshots

Both of them should talk about the ability to specify multiple repositories when querying the snapshots (TBD).

Examples

_snapshot API

Consider you have 2 repositories:

PUT _snapshot/repo1
PUT _snapshot/repo2

And you have one snapshot in each repository:

PUT _snapshot/repo1/snap1
PUT _snapshot/repo2/snap2

Now the following commands

GET _snapshot/[* | _all | repo*]/[* | _all | snap*]

will give you the same output

{
    "snapshots": [
        {
            "snapshot": "snap1",
            "uuid": "RsmIPKXtTCWr-oVTr_N4Qg",
            "repository": "repo1",
            "version_id": 8000099,
            "version": "8.0.0",
            "indices": [],
            "include_global_state": true,
            "state": "SUCCESS",
            "start_time": "2019-05-06T09:40:43.689Z",
            "start_time_in_millis": 1557135643689,
            "end_time": "2019-05-06T09:40:43.723Z",
            "end_time_in_millis": 1557135643723,
            "duration_in_millis": 34,
            "failures": [],
            "shards": {
                "total": 0,
                "failed": 0,
                "successful": 0
            }
        },
        {
            "snapshot": "snap2",
            "uuid": "g7lBKBr-QDejNV-ajHwqYg",
            "repository": "repo2",
            "version_id": 8000099,
            "version": "8.0.0",
            "indices": [],
            "include_global_state": true,
            "state": "SUCCESS",
            "start_time": "2019-05-06T09:40:55.093Z",
            "start_time_in_millis": 1557135655093,
            "end_time": "2019-05-06T09:40:55.125Z",
            "end_time_in_millis": 1557135655125,
            "duration_in_millis": 32,
            "failures": [],
            "shards": {
                "total": 0,
                "failed": 0,
                "successful": 0
            }
        }
    ]
}

Note the repository field in each serialized SnapshotInfoobject.

The following request

GET _snapshot/[* | _all | repo*]/snap1

results in

{
    "error": {
        "root_cause": [
            {
                "type": "snapshot_missing_exception",
                "reason": "[repo2:snap1] is missing"
            }
        ],
        "type": "snapshot_missing_exception",
        "reason": "[repo2:snap1] is missing"
    },
    "status": 404
}

because snap1 should exist in all repositories for this request to be succesful.
However, you can specify ignore_unavailable flag:

GET _snapshot/[* | _all | repo*]/snap1?ignore_unavailable=true

And get the following response:

{
    "snapshots": [
        {
            "snapshot": "snap1",
            "uuid": "RsmIPKXtTCWr-oVTr_N4Qg",
            "repository": "repo1",
            "version_id": 8000099,
            "version": "8.0.0",
            "indices": [],
            "include_global_state": true,
            "state": "SUCCESS",
            "start_time": "2019-05-06T09:40:43.689Z",
            "start_time_in_millis": 1557135643689,
            "end_time": "2019-05-06T09:40:43.723Z",
            "end_time_in_millis": 1557135643723,
            "duration_in_millis": 34,
            "failures": [],
            "shards": {
                "total": 0,
                "failed": 0,
                "successful": 0
            }
        }
    ]
}

This works, because repo2 is still checked for snap1 presence, but it's absense does not make the whole request failure.

Of course, you still can retrieve a particular snapshot from a particular repository as usual:

GET _snapshot/repo1/snap1

And get the following response back (same as before this commit, but enriched with repository field):

{
    "snapshots": [
        {
            "snapshot": "snap1",
            "uuid": "RsmIPKXtTCWr-oVTr_N4Qg",
            "repository": "repo1",
            "version_id": 8000099,
            "version": "8.0.0",
            "indices": [],
            "include_global_state": true,
            "state": "SUCCESS",
            "start_time": "2019-05-06T09:40:43.689Z",
            "start_time_in_millis": 1557135643689,
            "end_time": "2019-05-06T09:40:43.723Z",
            "end_time_in_millis": 1557135643723,
            "duration_in_millis": 34,
            "failures": [],
            "shards": {
                "total": 0,
                "failed": 0,
                "successful": 0
            }
        }
    ]
}

_cat/snapshot API

Consider you have 2 repositories:

PUT _snapshot/repo1
PUT _snapshot/repo2

And you have one snapshot in each repository:

PUT _snapshot/repo1/snap1
PUT _snapshot/repo2/snap2

Now you can get all the snapshots using any of this API calls:

GET _cat/snapshot
GET _cat/snapshot/_all
GET _cat/snapshot/*
GET _cat/snapshot/repo*
GET _cat/snapshot/repo1,repo2

The response will contain the repository name:

snap1 repo1 SUCCESS 1557147532 12:58:52 1557147532 12:58:52 29ms 0 0 0 0
snap2 repo2 SUCCESS 1557147526 12:58:46 1557147526 12:58:46 41ms 0 0 0 0

Closes #41210

@andrershov andrershov added the :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label May 3, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@@ -80,7 +85,11 @@ public GetSnapshotsRequest(StreamInput in) throws IOException {
@Override
public void writeTo(StreamOutput out) throws IOException {
super.writeTo(out);
out.writeString(repository);
if (out.getVersion().onOrAfter(Version.V_7_1_0)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should fail BwC tests shouldn't it? I think we gotta first set it to V8_0_0 and then back port and adjust the constant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@original-brownbear thanks, I've fixed it in 5691315.
The sequence of BwC story would be the following:

  1. Merge this PR.
  2. Create separate PR to disable snapshot related BwC tests in master and change version to 7.2.0.
  3. Backport the change to 7.2.0.
  4. Enable tests.

@cjcenizal
Copy link
Contributor

This commit adds repository name to each SnapshotInfo returned.

🎉 Thank you for making this change! I assume the repository name will always be an array of strings, even when the snapshot only belongs to a single repository, correct?

When particular snapshot names are specified instead of * or _all and ignoreUnavailable=false, then all these snapshots should exist in all repositories mentioned. This might be undesirable, probably reviewers can propose something better.

From the UI standpoint, I don't see this being problematic, because AFAIK we're only requesting either all snapshots in all repositories or a single snapshot within a single repository. @jen-huang Can you think of any issues with this?

Currently, you can specify ignoreUnvailable=true, in this case, it's not a problem.

By not a problem, you mean that snapshot info is returned if the snapshots exist in any of the specified repositories, correct?

@jen-huang
Copy link

When particular snapshot names are specified instead of * or _all and ignoreUnavailable=false, then all these snapshots should exist in all repositories mentioned. This might be undesirable, probably reviewers can propose something better.

Confirming that this is not a problem for the UI since the only time we request a particular snapshot is when we already know which repository it belongs to.

@andrershov
Copy link
Contributor Author

@cjcenizal I've extended PR description with Example section, please let me know if I should clarify something else.

Andrey Ershov added 2 commits May 6, 2019 12:43
@andrershov andrershov added v7.2.0 and removed v7.1.0 labels May 6, 2019
Andrey Ershov added 2 commits May 6, 2019 13:34
Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't make that API change to finalize. Each repository must know its own name (its in the metadata field in the blob store repo for example).

@original-brownbear
Copy link
Member

@andrershov seems BwC tests are still failing.

@andrershov
Copy link
Contributor Author

run elasticsearch-ci/1

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment about the change to SnapshotInfo, I think what we actually should adjust here is GetSnapshotsResponse to only serialize what we return over the wire, not what is stored in the repository.

SnapshotId snapshotId = snapshotInfo.snapshotId();
allSnapshotIds.put(snapshotId.getName(), snapshotId);
currentSnapshots.add(snapshotInfo);
List<SnapshotInfo> snapshotInfos = new ArrayList<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: We could use ActionListener#completeWith here :)

@@ -54,6 +54,7 @@
private static final DateFormatter DATE_TIME_FORMATTER = DateFormatter.forPattern("strictDateOptionalTime");
private static final String SNAPSHOT = "snapshot";
private static final String UUID = "uuid";
private static final String REPOSITORY = "repository";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should add this field to this class:
We're now serializing the repository name and writing it to the repository itself (SnapshotInfo is also written to the repository).
It seems to me we're only adding it here since we're also serializing SnapshotInfo in the new transport response => we should create some new class that has all the info SnapshotInfo contains + the repository name and serialize that over the wire?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@original-brownbear I don't see how adding repository field to SnapshotInfo is harmful.
Yes, I see that BlobStoreRepository writes SnapshotInfo to disk and reads it, but it does not seem to be a problem if the repository name is immutable.
I find it natural that SnapshotInfo, which has the following JavaDoc Information about a snapshot contains information about the repository name for which this snapshot was taken.
At the same time, I'm not a proponent of adding unnecessary classes unless we can come up with a good argument why current implementation won't work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It simply seems wrong and needless to go through all the steps of adding BwC, changing the format of what we serialize in the snapshot repository (which always introduces risk) etc. when we can resolve this trivially by simply changing the response serialization.
It would even be a smaller change.
=> I really don't see an argument for changing the serialization in the repo or over the wire when we already can get all the data we need inside the transport action. All we do with this change is store redundant data in the repo, add risk and BwC issues when we can have the same change without those downsides by simply fixing the response serialization.

Not adding a new class here to keep complexity down isn't really an argument either because the whole logic to figure out the state of things is so complex precisely because we are sharing SnapshotInfo across functionalities here (sorta related to #40943 (comment)).

@andrershov
Copy link
Contributor Author

@cjcenizal I'm a bit worried about the following requirement in the original issue "In addition to providing information about which repository a snapshot belongs to, the endpoint should also return a collection of any errors that repositories respond with, associated with the name of the repository."
Can you give an example of such errors and how do you handle them today when you have separate API call to get the list of all repositories and a separate API call to get the list of the snapshots?

@andrershov
Copy link
Contributor Author

andrershov commented May 7, 2019

@cjcenizal I think I need to extend response format like this:

{
  "snapshots" : [
    {
      ...
      "repository" : "repo1"
      ...
    },
    ...
   ],
   "failures" : [
     {
       "repository" : "repo2",
       "error" : {....}
     }
   ]
}

And semantics would be changed in such a way, that if some exception happens when retrieving the snapshots from one repository this repository will appear in "errors" section and snapshots from well-functioning repositories will appear in snapshots section. There might be at most one error for one repository.
If there is a higher level error (for example, some of the repositories in the request do not exist) OR if there is a single repository name in the request it will be reported like this:

{
   "error" : {
       ...
    }
}

This format is also carefully chosen not to break BwC.

If I were to design it from scratch I would probably do it differently:

{
   “repositories” : [
      {
        “repo” : “repo1”,
        “snapshots” : []
      },
      {
        “repo”: “repo2”,
         “error” : {}
       }
   ]
}

But this clearly breaks BwC and I don't think we should create new endpoint.

Let me know what do you think.

@cjcenizal
Copy link
Contributor

cjcenizal commented May 7, 2019

@andrershov Thank you for catching that requirement! I'd forgotten about that.

For context, the errors I've seen are generally those returned by misconfigured repositories when retrieving their snapshots, and so their shape and status code varies by repository type. Here's how we're currently collecting these errors. We request all snapshots of each repository, and sometimes this request results in an error. I've seen 400, 404, and 500 errors in these cases. In the UI we aren't actually surfacing these errors yet, but it's possible we will at some point.

I think the response format you shared will work well for us. In the case there's a single top-level error, I assume that will return a 4xx status?

@jen-huang Thoughts?

@andrershov
Copy link
Contributor Author

andrershov commented May 7, 2019

With this newly proposed format, I see the following issues:

  • Consider there is a request to the single repository made by the transport client and this request fails. For example, with SnapshotMissingException. For BwC reasons, we need to throw SnapshotMissingException instead of returning GetSnapshotResponse with a single failure.
  • Consider there is a request to the single repository made by the REST client and this request fails. For example, with SnapshotMissingException. For BwC reasons, we need to return the top-level error (actually this is done automatically because we throw SnasphotMissingException as shown above) instead of failures with the array of failed repositories.
  • _cat/snapshot API. If we decide to support multiple repositories in _cat/snapshot API how do we display partial request failure? With a single repository, there would be a top-level failure that will be rendered instead of the table. But if there are multiple repositories, what do we render? Options are: support only single repo, support multiple repositories, but ignore failures and render only successful repositories.

@original-brownbear what do you think?

@original-brownbear
Copy link
Member

@andrershov @cjcenizal
Sorry for a bit of a meta response here, but:

I'm wondering if we actually need:

In addition to providing information about which repository a snapshot belongs to, the endpoint should also return a collection of any errors that repositories respond with, associated with the name of the repository."

Can't we simply throw the first error we encounter and let it bubble up to the top level like we did before? (or an aggregate error for all the repos).

It seems kind of weird to me to collect the errors separately when you could argue that the request simply failed if any single repository failed since one wanted to search through all of them.

Otherwise this situation that @andrershov points out gets really weird:

Consider there is a request to the single repository made by the transport client and this request fails. For example, with SnapshotMissingException. For BwC reasons, we need to throw SnapshotMissingException instead of returning GetSnapshotResponse with a single failure.

I'd go even further than this. If we run this request (say for a single snaptshot) for two repositories, and we don't find the snapshot in repo A and get some random error from repo B. Wouldn't we just want to return the error from repo B since it supersedes the SnapshotNotFoundException (since we wouldn't know if it's not found since we couldn't check B)?

To me it seems that since (I agree with this statement 100%):

For context, the errors I've seen are generally those returned by missconfigured repositories

it might not make too much sense to be able to just keep going in the UI when you should be fixing your repo settings in the first place?


That said, if we think there's actually a use case for this. Why not just retain the current behavior for single snapshot + repo requests and list failures only when the request targets multiple repositories as well as multiple snapshots? That seems somewhat consistent to me.

@cjcenizal
Copy link
Contributor

cjcenizal commented May 7, 2019

It seems kind of weird to me to collect the errors separately when you could argue that the request simply failed if any single repository failed since one wanted to search through all of them.

The way the UI will use this request is to display a list of all available snapshots, so in this case the important part of the response is the available snapshots, even if the request is partially failed.

it might not make too much sense to be able to just keep going in the UI when you should be fixing your repo settings in the first place?

The UI has two separate lists, one for repositories and one for snapshots. Currently, we're only surfacing errors in the list of repositories, which uses a different endpoint than the one changed in this PR. The list of snapshots will use the endpoint changed in this PR and, ideally, it will show the user all of their available snapshots so they can delete them, inspect them, restore from them, etc, even if some repositories are misconfigured.

That being said, I don't think we currently are surfacing these errors in the UI, but I can see a future where we may want to. We can skip the errors requirement for now if that makes things easier.

@cjcenizal
Copy link
Contributor

Shoot, sorry, I just realized that we are consuming the errors in the UI. Sorry for the confusion. Here's how we are surfacing this information:

image

@andrershov
Copy link
Contributor Author

andrershov commented May 8, 2019

@original-brownbear this a bulk request and bulk requests could partially fail and it's perfectly normal to return errors for those parts of the request that have failed and results for the successful parts. For example, we're doing the same for multi-get request.
Probably I was unclear about describing the response format in case of errors. The example you give - we're requesting snap1 from repo1 and repo2 and repo1 returns SnapshotMissingException and repo2 returns some random error. In this case, the response will look like this:

{
  "snapshots" : [],
  "failures" : [
    {
      "repository" : "repo1",
      "error" : {snashot missing}
    },
    {
       "repository" : "repo2",
       "error" : {some random error}
     }
   ]
}

So we separately report errors for each repository. And we don't need to be BwC here, because our previous API only supported one repo and we need to be BwC for one repo case.
And this is exactly the problem because with the new response format I'd return:

{
  "snapshots" : [],
  "failures" : [
    {
      "repository" : "repo1",
      "error" : {snashot missing}
    }
   ]
}

And with the old response format we should return:

{
      "error" : {snapshot missing}
}

That's why I say that if there is a single repo in the request and it has failed, we return errors in the old format. But I'm not satisfied with it, because now new clients will contain special case handling logic for one repository case. The way people do BwC usually is to mix formats:

{
 "snapshots" : [],
  "failures" : [
    {
      "repository" : "repo1",
      "error" : {snashot missing}
    }
   ],
  "error" : {snapshot missing}
}

So that old clients can read "error" field as usual and new clients can read "failures" field.
So for REST, there is a solution (however, technically it might be tricky - do we add "error" field to GetSnapshotsResponse? And what HTTP code do we return - old client expects 404, new client expects 200 OK), but we also need to understand what to do with the transport client.
We can not simply return GetSnapshotsResponse with failures because old transport client knows nothing about failures and expects SnapshotMissingException instead.
I think there is no good solution to the problem and probably need to introduce new endpoint.

@original-brownbear
Copy link
Member

@andrershov

I think there is no good solution to the problem and probably need to introduce new endpoint.

+1 to the first part, it seems tricky to find a nice BwC compatible solution here (though I don't know enough about our compatibility story to be able to tell how BwC we need to be here, I guess we can "team-discuss" it later :)).

@andrershov
Copy link
Contributor Author

@cjcenizal we discussed it on the team meeting and we see 3 options:

  1. Add "fail_fast" flag to the request and set it to true by default. If set to true request will fail with one top error response. If explicitly set to false, new response format will be used.
  2. Add new endpoint for snapshots multi get.
  3. Make breaking changes in response format and do not backport it to 7.2.

We agreed that (3) is the simplest option and we will go with it. So this feature will be included in Elasticsearch release 8.0 and you should be able to use it starting with Kibana 8.0. For 7.x series, we propose you to continue sending requests to the repositories one-by-one.
WDYT?

@cjcenizal
Copy link
Contributor

@andrershov That sounds great, thank you! The options you considered sound comprehensive, and I think the decision to make a clean break at a major release is the right one. Our current solution on the Kibana side should be fine until we hit 8.0.

@andrershov
Copy link
Contributor Author

@cjcenizal @original-brownbear closing this in favour of #42090

@andrershov andrershov closed this May 10, 2019
pull bot pushed a commit to rogues-gallery/elasticsearch that referenced this pull request Jun 19, 2019
This commit adds multiple repositories support to get snapshots
request.
If some repository throws an exception this method does not fail fast
instead, it returns results for all repositories.
This PR is opened in favour of elastic#41799, because we decided to change
the response format in a non-BwC manner. It makes sense to read a
discussion of the aforementioned PR.
This is the continuation of work done here elastic#15151.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Snapshots endpoint for retrieving all snapshots
6 participants