Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get snapshots support for multiple repositories #42090

Merged
merged 40 commits into from
Jun 19, 2019

Conversation

andrershov
Copy link
Contributor

Description

This commit adds multiple repositories support to get snapshots request.
If some repository throws an exception this method does not fail fast instead, it returns results for all repositories.
This PR is opened in favour of #41799, because we decided to change the response format in a non-BwC manner. It makes sense to read discussion of the aforementioned PR.
This is the continuation of work done here #15151.

_snapshot API

Now the following requests are supported:

GET /_snapshot/[_all | * | repo* | repo1,repo2]/[_all | * | snap* | snap1,snap2]

This commit breaks BwC of response format for this request, that's why this PR is not backported to 7.x. For the response format see examples sections.

_cat/snapshots API

Cat snapshot API also supports multiple repositories now.

GET /_cat/snapshots/[_all | * | repo* | repo1,repo2]

In the table returned new column named "repository" is added. If some repo fails with an exception, this method just reports that some repos have failed. See examples section.

GET /_cat/snapshots

is also supported and will return all snapshots for all repositories.
Interestingly enough, the previous implementation also was registering /_cat/snapshots filter but a request to this endpoint always resulted in

{
    "error": {
        "root_cause": [
            {
                "type": "action_request_validation_exception",
                "reason": "Validation Failed: 1: repository is missing;"
            }
        ],
        "type": "action_request_validation_exception",
        "reason": "Validation Failed: 1: repository is missing;"
    },
    "status": 400
}

Implementation

Technically, TransportGetSnapshotsAction calls TransportGetRepositoriesAction which resolves repo names and wildcards in request to concrete repository names.
After that, we submit tasks in parallel for all returned repositories and are using the already implemented algorithm to get matching snapshots in the repositories.
In terms of the transport level protocol, 2 classes are changed:

  • GetSnapshotsRequest now accepts the list of repository names (including wildcards) instead of single repo name.
  • GetSnapshotsResponse is changed in non-BwC manner.

In terms of REST protocol, 2 classes are changed:

  • SnapshotRequestsConverters.getSnapshots now can understand the list of repository names in GetSnapshotRequest and adds it as comma separated list to the request path.
  • RestGetSnapshotsAction can now parse the comma-separated list of the repo names.
  • GetSnapshotsResponse XContent serialization is completely changed.

RestSnapshotAction which is responsible for handling cat API for the snapshots is extended with repository field as well.

Testing

  • SharedClusterSnapshotRestoreIT.testGetSnapshotsMultipleRepos tests that multiple repositories and wildcard work fine on transport level.
  • SnapshotRequestConvertersTests.getSnapshots tests that high-level REST client correctly sends the request when multiple repositories are used in the request.
  • SnapshotIT.testGetSnapshots tests getting 2 snapshots from 2 different repositories using high-level REST client.
  • Some BwC tests, because we need compatibility between 7.latest and 8 (TBD)

Documentation

Seems that it makes sense to make adjustments to the following documentation pages:

  • cat-snapshots
  • modules-snapshots

Both of them should talk about the ability to specify multiple repositories when querying the snapshots (TBD).

Since it's breaking change also breaking changes doc should be extended (TBD).

Examples

Working with GetSnapshotsResponse

Consider you have response.
First of all, you can check if there are any failures response.isFailed().
If you were using wildcard/regex for repository names, you can use response.getRepositories to get the set of repositories.
If you know particular repository name, you can just do response.getSnapshots(repoName) and it will return the list of SnapshotInfo for the repository or it throws ElasticsearchException. So all methods that previously called response.getSnapshots() can now achieve the same behaviour by just passing the repoName.
Finally, you can get a map of successful and failed responses using getSuccessfulResponses and getFailedResponses.

_snapshot API

Consider you have 2 repositories:

PUT _snapshot/repo1
PUT _snapshot/repo2

And you have one snapshot in each repository:

PUT _snapshot/repo1/snap1
PUT _snapshot/repo2/snap2

Now the following commands

GET _snapshot/[* | _all | repo*]/[* | _all | snap*]

will give you the same output

{
    "responses": [
        {
            "repository": "repo2",
            "snapshots": [
                {
                    "snapshot": "snap2",
                    "uuid": "XgGZx_QjRc6J5YkbbDWsdQ",
                    "version_id": 8000099,
                    "version": "8.0.0",
                    "indices": [],
                    "include_global_state": true,
                    "state": "SUCCESS",
                    "start_time": "2019-05-10T17:02:02.695Z",
                    "start_time_in_millis": 1557507722695,
                    "end_time": "2019-05-10T17:02:02.729Z",
                    "end_time_in_millis": 1557507722729,
                    "duration_in_millis": 34,
                    "failures": [],
                    "shards": {
                        "total": 0,
                        "failed": 0,
                        "successful": 0
                    }
                }
            ]
        },
        {
            "repository": "repo1",
            "snapshots": [
                {
                    "snapshot": "snap1",
                    "uuid": "cEzdqUKxQ5G6MyrJAcYwmA",
                    "version_id": 8000099,
                    "version": "8.0.0",
                    "indices": [],
                    "include_global_state": true,
                    "state": "SUCCESS",
                    "start_time": "2019-05-10T17:01:57.868Z",
                    "start_time_in_millis": 1557507717868,
                    "end_time": "2019-05-10T17:01:57.909Z",
                    "end_time_in_millis": 1557507717909,
                    "duration_in_millis": 41,
                    "failures": [],
                    "shards": {
                        "total": 0,
                        "failed": 0,
                        "successful": 0
                    }
                }
            ]
        }
    ]
}

Responses are currently not sorted by repository name. However, SnapshotInfos are still sorted by startDate and then by snapshotId.

The following request

GET _snapshot/[* | _all | repo*]/snap1

results in

{
    "responses": [
        {
            "repository": "repo1",
            "snapshots": [
                {
                    "snapshot": "snap1",
                    "uuid": "cEzdqUKxQ5G6MyrJAcYwmA",
                    "version_id": 8000099,
                    "version": "8.0.0",
                    "indices": [],
                    "include_global_state": true,
                    "state": "SUCCESS",
                    "start_time": "2019-05-10T17:01:57.868Z",
                    "start_time_in_millis": 1557507717868,
                    "end_time": "2019-05-10T17:01:57.909Z",
                    "end_time_in_millis": 1557507717909,
                    "duration_in_millis": 41,
                    "failures": [],
                    "shards": {
                        "total": 0,
                        "failed": 0,
                        "successful": 0
                    }
                }
            ]
        },
        {
            "repository": "repo2",
            "error": {
                "root_cause": [
                    {
                        "type": "snapshot_missing_exception",
                        "reason": "[repo2:snap1] is missing"
                    }
                ],
                "type": "snapshot_missing_exception",
                "reason": "[repo2:snap1] is missing"
            }
        }
    ]
}

because snap1 exists in repo1, but does not exist in repo2.
This is an example of partial failure.
Note that HTTP status code is 200 OK, despite partial failure.
Even with full failure status code will be 200 OK.
Currently, successful repositories are always reported on top.
If you specify ignore_unavailable flag:

GET _snapshot/[* | _all | repo*]/snap1?ignore_unavailable=true

And get the following response:

{
    "responses": [
        {
            "repository": "repo2",
            "snapshots": []
        },
        {
            "repository": "repo1",
            "snapshots": [
                {
                    "snapshot": "snap1",
                    "uuid": "cEzdqUKxQ5G6MyrJAcYwmA",
                    "version_id": 8000099,
                    "version": "8.0.0",
                    "indices": [],
                    "include_global_state": true,
                    "state": "SUCCESS",
                    "start_time": "2019-05-10T17:01:57.868Z",
                    "start_time_in_millis": 1557507717868,
                    "end_time": "2019-05-10T17:01:57.909Z",
                    "end_time_in_millis": 1557507717909,
                    "duration_in_millis": 41,
                    "failures": [],
                    "shards": {
                        "total": 0,
                        "failed": 0,
                        "successful": 0
                    }
                }
            ]
        }
    ]
}

If you specify missing repository in the request

GET _snapshot/repo1,repo2,repo3/snap1

You get top-level error:

{
    "error": {
        "root_cause": [
            {
                "type": "repository_missing_exception",
                "reason": "[repo3] missing"
            }
        ],
        "type": "repository_missing_exception",
        "reason": "[repo3] missing"
    },
    "status": 404
}

_cat/snapshot API

Consider you have 2 repositories:

PUT _snapshot/repo1
PUT _snapshot/repo2

And you have one snapshot in each repository:

PUT _snapshot/repo1/snap1
PUT _snapshot/repo2/snap2

Now you can get all the snapshots using any of this API calls:

GET _cat/snapshot
GET _cat/snapshot/_all
GET _cat/snapshot/*
GET _cat/snapshot/repo*
GET _cat/snapshot/repo1,repo2

The response will contain the repository name:

snap2 repo2 SUCCESS 1557507722 17:02:02 1557507722 17:02:02 34ms 0 0 0 0
snap1 repo1 SUCCESS 1557507717 17:01:57 1557507717 17:01:57 41ms 0 0 0 0

Currently snapshots are not sorted by repo name.
If there is an unexpected error for at least one of the repositories you will get the following
response:

{
    "error": {
        "root_cause": [
            {
                "type": "exception",
                "reason": "Repositories [repo1,repo2] failed to retrieve snapshots"
            }
        ],
        "type": "exception",
        "reason": "Repositories [repo1,repo2] failed to retrieve snapshots"
    },
    "status": 500
}

Closes #41210

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, mainly around the concurrent repo polling and thread pool usage.

allSnapshotIds.put(snapshotId.getName(), snapshotId);
// run concurrently for all repos on SNAPSHOT thread pool
for (final RepositoryMetaData repo : repos) {
futures.add(threadPool.executor(ThreadPool.Names.SNAPSHOT).submit(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the SNAPSHOT pool is the right choice here. If you have an ongoing snapshot, this code could become completely blocked potentially. I think using the GENERIC pool here (while always a little dirty) is the safer and more stable choice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final Path repoPath = randomRepoPath();
logger.info("--> create repository with name " + repoName);
assertAcked(client.admin().cluster().preparePutRepository(repoName)
.setType("mock").setSettings(Settings.builder()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to use the mock wrapper here. Let's just use quick one liner like in https://github.com/elastic/elasticsearch/pull/42090/files#diff-26934c4ac1260dd2d92d86b427567676R423 to get a new FS repo :) Not the worst idea to keep this class' length/complexity down imo :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
}

Supplier<String[]> repoNames = () -> randomFrom(new String[]{"_all"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably easier to read if you just inline this into the single use of this supplier :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops, 94f4d63

this.failedResponses = new HashMap<>();
for (Response response : responses) {
if (response.snapshots != null) {
this.successfulResponses.put(response.repository, response.snapshots);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe assert that there wasn't anything in this map for key response.repository ? (same goes for 2 lines down failedResponses.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* Returns a map of repository name to the list of {@link SnapshotInfo} for each successful response.
*/
public Map<String, List<SnapshotInfo>> getSuccessfulResponses() {
return Map.copyOf(successfulResponses);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: instead of mutating the field in place in the setter, just assign a Collections.unmodifyableMap to successfulResponses and failedResponses to save yourself the copies in the getters here and stay a little more aligned with the style we use elsewhere. (+ Map.copyOf won't back port to 7.x :))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (out.getVersion().onOrAfter(MULTIPLE_REPOSITORIES_SUPPORT_ADDED)) {
out.writeStringArray(repositories);
} else {
out.writeString(repositories[0]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could lead to some strange behaviour in a mixed cluster couldn't it? If I fire a request for multiple snapshots at an older master it'll just respond with the snapshots for a single repo. Also, what about special values like _all here?

Shouldn't we assert repositories.length == 1 here and then add logic to ensure we err out on multiple-repository requests in a mixed version cluster instead of quietly returning "something"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, GetSnapshotsResponse.writeTo method protects us from this problem. https://github.com/elastic/elasticsearch/pull/42090/files#diff-4660cd4d86c46cf0b715f6a338e80034R210
So when running 7.latest and 8.x cluster mixed cluster. If 7.latest node receives the request it will generate GetSnapshotResponse and its writeTo method will throw IllegalArgumentException if snapshots from multiple repos are requested.
Because of this special values, like "_all", "*" it's hard to tell in advance if snapshots for multiple repos are requested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I'm not sure that writeTo/readFrom logic is executed if local transport is used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If 7.latest node receives the request it will generate GetSnapshotResponse and its writeTo method will throw IllegalArgumentException if snapshots from multiple repos are requested.

Not sure I understand. We are simply writing a truncated request to 7.x nodes here then we won't get failures but simply send a truncated request and get unexpected results (concretely, if we request multiple repos and hit an 8.0 node we will simply get a response containing the snapshots for the first repository and simply ignore the rest won't we?)?

@andrershov
Copy link
Contributor Author

Thanks, @original-brownbear for the review, PR is ready for the second pass.

@original-brownbear
Copy link
Member

@andrershov on it shortly :)

Also, checkstyle complains here:

00:14:51 > Task :server:checkstyleTest
00:14:51 [ant:checkstyle] [ERROR] /var/lib/jenkins/workspace/elastic+elasticsearch+pull-request-1/server/src/test/java/org/elasticsearch/snapshots/SharedClusterSnapshotRestoreIT.java:119:8: Unused import - java.util.function.Supplier. [UnusedImports]
00:14:51 
00:14:51 > Task :server:checkstyleTest FAILED

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me, apart from the BwC question I left (+ checkstyle) I think.

if (out.getVersion().onOrAfter(MULTIPLE_REPOSITORIES_SUPPORT_ADDED)) {
out.writeStringArray(repositories);
} else {
out.writeString(repositories[0]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If 7.latest node receives the request it will generate GetSnapshotResponse and its writeTo method will throw IllegalArgumentException if snapshots from multiple repos are requested.

Not sure I understand. We are simply writing a truncated request to 7.x nodes here then we won't get failures but simply send a truncated request and get unexpected results (concretely, if we request multiple repos and hit an 8.0 node we will simply get a response containing the snapshots for the first repository and simply ignore the rest won't we?)?

out.writeException(error.getValue());
}
} else {
if (successfulResponses.size() + failedResponses.size() != 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned above,this is kind of weird to me. How would an older node request multiple repositories in the first place? It seems like we don't have to worry about this, but instead have to throw on newer nodes when master is too old.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@original-brownbear Oh, yes, I agree.
Let's make it clear we don't backport these changes to 7.x at all. And yes, if 8.x node sends GetSnapshotsRequest to 7.x master, then it needs to check that only one repository snapshots are requested.
But I think, we still need this check in GetSnapshotsResponse. 7.x node never sends more than one repository in the request to 8.x node, however, it could be not the name of the repo, but "repo*" or "_all". Despite being one string, it resolves into multiple repos on 8.x node, so 8.x node might return more than one response and needs to throw an exception if it's actually the case. 71f6aa1

@original-brownbear
Copy link
Member

@andrershov just FYI tests are broken here now (just in case no Slack bot informed you about it :)).

andrershov added a commit that referenced this pull request Jun 25, 2019
#42090 PR added support
for requesting snapshots from multiple repositories. And it has changed
the response format in a non-BwC way.
There is a mentioning of a response format change in the breaking
changes docs, however, there is no example of how new format looks
like. Pointed out by @dakrone.
This commit adds the missing example.
bizybot pushed a commit to bizybot/elasticsearch that referenced this pull request Jul 7, 2019
Due to recent changes done for converting `repository-hdfs` to test
clusters (elastic#41252), the `integTestSecure*` tasks did not depend on
`secureHdfsFixture` which when running would fail as the fixture
would not be available. This commit adds the dependency of fixture
to the task.

The `secureHdfsFixture` is an AntFixture which is spawned process.
Internally it waits for 30 seconds for the resources to be made available.
For my local machine it took almost 45 seconds to be available so I have
added the wait time as an input to the AntFixture defaults to 30 seconds
 and set it to 60 seconds in case of secure hdfs fixture.

Another problem while running the `secureHdfsFixture` where it would fail
due to port not being privileged port (i.e system port, port < 1024).
By default datanode address key tries to find a free port and this would
later on fail in case it is running in a secure mode. To address this
in case of secure mode we find free port below 1024 and set it in the
config. The config `DFSConfigKeys.IGNORE_SECURE_PORTS_FOR_TESTING_KEY` is set to
`true` but it did not help.
https://fisheye.apache.org/browse/~br=branch-2.8.1/hadoop/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/SecureDataNodeStarter.java?hb=true#to140

The integ test for secure hdfs were disabled for long time and so
the changes done in elastic#42090 to fix the tests are also done in this commit.
bizybot added a commit that referenced this pull request Jul 10, 2019
Due to recent changes are done for converting `repository-hdfs` to test
clusters (#41252), the `integTestSecure*` tasks did not depend on
`secureHdfsFixture` which when running would fail as the fixture
would not be available. This commit adds the dependency of the fixture
to the task.

The `secureHdfsFixture` is a `AntFixture` which is spawned a process.
Internally it waits for 30 seconds for the resources to be made available.
For my local machine, it took almost 45 seconds to be available so I have
added the wait time as an input to the `AntFixture` defaults to 30 seconds
 and set it to 60 seconds in case of secure hdfs fixture.

The integ test for secure hdfs was disabled for a long time and so
the changes done in #42090 to fix the tests are also done in this commit.
bizybot added a commit to bizybot/elasticsearch that referenced this pull request Jul 12, 2019
Due to recent changes are done for converting `repository-hdfs` to test
clusters (elastic#41252), the `integTestSecure*` tasks did not depend on
`secureHdfsFixture` which when running would fail as the fixture
would not be available. This commit adds the dependency of the fixture
to the task.

The `secureHdfsFixture` is a `AntFixture` which is spawned a process.
Internally it waits for 30 seconds for the resources to be made available.
For my local machine, it took almost 45 seconds to be available so I have
added the wait time as an input to the `AntFixture` defaults to 30 seconds
 and set it to 60 seconds in case of secure hdfs fixture.

The integ test for secure hdfs was disabled for a long time and so
the changes done in elastic#42090 to fix the tests are also done in this commit.
bizybot added a commit to bizybot/elasticsearch that referenced this pull request Jul 12, 2019
Due to recent changes are done for converting `repository-hdfs` to test
clusters (elastic#41252), the `integTestSecure*` tasks did not depend on
`secureHdfsFixture` which when running would fail as the fixture
would not be available. This commit adds the dependency of the fixture
to the task.

The `secureHdfsFixture` is a `AntFixture` which is spawned a process.
Internally it waits for 30 seconds for the resources to be made available.
For my local machine, it took almost 45 seconds to be available so I have
added the wait time as an input to the `AntFixture` defaults to 30 seconds
 and set it to 60 seconds in case of secure hdfs fixture.

The integ test for secure hdfs was disabled for a long time and so
the changes done in elastic#42090 to fix the tests are also done in this commit.
bizybot added a commit to bizybot/elasticsearch that referenced this pull request Jul 12, 2019
Due to recent changes are done for converting `repository-hdfs` to test
clusters (elastic#41252), the `integTestSecure*` tasks did not depend on
`secureHdfsFixture` which when running would fail as the fixture
would not be available. This commit adds the dependency of the fixture
to the task.

The `secureHdfsFixture` is a `AntFixture` which is spawned a process.
Internally it waits for 30 seconds for the resources to be made available.
For my local machine, it took almost 45 seconds to be available so I have
added the wait time as an input to the `AntFixture` defaults to 30 seconds
 and set it to 60 seconds in case of secure hdfs fixture.

The integ test for secure hdfs was disabled for a long time and so
the changes done in elastic#42090 to fix the tests are also done in this commit.
bizybot added a commit that referenced this pull request Jul 12, 2019
Due to recent changes are done for converting `repository-hdfs` to test
clusters (#41252), the `integTestSecure*` tasks did not depend on
`secureHdfsFixture` which when running would fail as the fixture
would not be available. This commit adds the dependency of the fixture
to the task.

The `secureHdfsFixture` is a `AntFixture` which is spawned a process.
Internally it waits for 30 seconds for the resources to be made available.
For my local machine, it took almost 45 seconds to be available so I have
added the wait time as an input to the `AntFixture` defaults to 30 seconds
 and set it to 60 seconds in case of secure hdfs fixture.

The integ test for secure hdfs was disabled for a long time and so
the changes done in #42090 to fix the tests are also done in this commit.
bizybot added a commit that referenced this pull request Jul 12, 2019
Due to recent changes are done for converting `repository-hdfs` to test
clusters (#41252), the `integTestSecure*` tasks did not depend on
`secureHdfsFixture` which when running would fail as the fixture
would not be available. This commit adds the dependency of the fixture
to the task.

The `secureHdfsFixture` is a `AntFixture` which is spawned a process.
Internally it waits for 30 seconds for the resources to be made available.
For my local machine, it took almost 45 seconds to be available so I have
added the wait time as an input to the `AntFixture` defaults to 30 seconds
 and set it to 60 seconds in case of secure hdfs fixture.

The integ test for secure hdfs was disabled for a long time and so
the changes done in #42090 to fix the tests are also done in this commit.
bizybot added a commit that referenced this pull request Jul 12, 2019
Due to recent changes are done for converting `repository-hdfs` to test
clusters (#41252), the `integTestSecure*` tasks did not depend on
`secureHdfsFixture` which when running would fail as the fixture
would not be available. This commit adds the dependency of the fixture
to the task.

The `secureHdfsFixture` is a `AntFixture` which is spawned a process.
Internally it waits for 30 seconds for the resources to be made available.
For my local machine, it took almost 45 seconds to be available so I have
added the wait time as an input to the `AntFixture` defaults to 30 seconds
 and set it to 60 seconds in case of secure hdfs fixture.

The integ test for secure hdfs was disabled for a long time and so
the changes done in #42090 to fix the tests are also done in this commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Snapshots endpoint for retrieving all snapshots
5 participants