-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get snapshots support for multiple repositories #42090
Conversation
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments, mainly around the concurrent repo polling and thread pool usage.
allSnapshotIds.put(snapshotId.getName(), snapshotId); | ||
// run concurrently for all repos on SNAPSHOT thread pool | ||
for (final RepositoryMetaData repo : repos) { | ||
futures.add(threadPool.executor(ThreadPool.Names.SNAPSHOT).submit( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the SNAPSHOT
pool is the right choice here. If you have an ongoing snapshot, this code could become completely blocked potentially. I think using the GENERIC
pool here (while always a little dirty) is the safer and more stable choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...n/java/org/elasticsearch/action/admin/cluster/snapshots/get/TransportGetSnapshotsAction.java
Outdated
Show resolved
Hide resolved
final Path repoPath = randomRepoPath(); | ||
logger.info("--> create repository with name " + repoName); | ||
assertAcked(client.admin().cluster().preparePutRepository(repoName) | ||
.setType("mock").setSettings(Settings.builder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to use the mock wrapper here. Let's just use quick one liner like in https://github.com/elastic/elasticsearch/pull/42090/files#diff-26934c4ac1260dd2d92d86b427567676R423 to get a new FS repo :) Not the worst idea to keep this class' length/complexity down imo :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
} | ||
|
||
Supplier<String[]> repoNames = () -> randomFrom(new String[]{"_all"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably easier to read if you just inline this into the single use of this supplier :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooops, 94f4d63
this.failedResponses = new HashMap<>(); | ||
for (Response response : responses) { | ||
if (response.snapshots != null) { | ||
this.successfulResponses.put(response.repository, response.snapshots); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe assert
that there wasn't anything in this map for key response.repository
? (same goes for 2 lines down failedResponses
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Returns a map of repository name to the list of {@link SnapshotInfo} for each successful response. | ||
*/ | ||
public Map<String, List<SnapshotInfo>> getSuccessfulResponses() { | ||
return Map.copyOf(successfulResponses); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: instead of mutating the field in place in the setter, just assign a Collections.unmodifyableMap
to successfulResponses
and failedResponses
to save yourself the copies in the getters here and stay a little more aligned with the style we use elsewhere. (+ Map.copyOf
won't back port to 7.x :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...src/main/java/org/elasticsearch/action/admin/cluster/snapshots/get/GetSnapshotsResponse.java
Outdated
Show resolved
Hide resolved
if (out.getVersion().onOrAfter(MULTIPLE_REPOSITORIES_SUPPORT_ADDED)) { | ||
out.writeStringArray(repositories); | ||
} else { | ||
out.writeString(repositories[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could lead to some strange behaviour in a mixed cluster couldn't it? If I fire a request for multiple snapshots at an older master
it'll just respond with the snapshots for a single repo. Also, what about special values like _all
here?
Shouldn't we assert repositories.length == 1
here and then add logic to ensure we err out on multiple-repository requests in a mixed version cluster instead of quietly returning "something"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, GetSnapshotsResponse.writeTo
method protects us from this problem. https://github.com/elastic/elasticsearch/pull/42090/files#diff-4660cd4d86c46cf0b715f6a338e80034R210
So when running 7.latest and 8.x cluster mixed cluster. If 7.latest node receives the request it will generate GetSnapshotResponse
and its writeTo
method will throw IllegalArgumentException
if snapshots from multiple repos are requested.
Because of this special values, like "_all", "*" it's hard to tell in advance if snapshots for multiple repos are requested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I'm not sure that writeTo/readFrom logic is executed if local transport is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If 7.latest node receives the request it will generate GetSnapshotResponse and its writeTo method will throw IllegalArgumentException if snapshots from multiple repos are requested.
Not sure I understand. We are simply writing a truncated request to 7.x nodes here then we won't get failures but simply send a truncated request and get unexpected results (concretely, if we request multiple repos and hit an 8.0 node we will simply get a response containing the snapshots for the first repository and simply ignore the rest won't we?)?
Thanks, @original-brownbear for the review, PR is ready for the second pass. |
@andrershov on it shortly :) Also, checkstyle complains here:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me, apart from the BwC question I left (+ checkstyle) I think.
if (out.getVersion().onOrAfter(MULTIPLE_REPOSITORIES_SUPPORT_ADDED)) { | ||
out.writeStringArray(repositories); | ||
} else { | ||
out.writeString(repositories[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If 7.latest node receives the request it will generate GetSnapshotResponse and its writeTo method will throw IllegalArgumentException if snapshots from multiple repos are requested.
Not sure I understand. We are simply writing a truncated request to 7.x nodes here then we won't get failures but simply send a truncated request and get unexpected results (concretely, if we request multiple repos and hit an 8.0 node we will simply get a response containing the snapshots for the first repository and simply ignore the rest won't we?)?
out.writeException(error.getValue()); | ||
} | ||
} else { | ||
if (successfulResponses.size() + failedResponses.size() != 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned above,this is kind of weird to me. How would an older node request multiple repositories in the first place? It seems like we don't have to worry about this, but instead have to throw on newer nodes when master is too old.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@original-brownbear Oh, yes, I agree.
Let's make it clear we don't backport these changes to 7.x at all. And yes, if 8.x node sends GetSnapshotsRequest
to 7.x master, then it needs to check that only one repository snapshots are requested.
But I think, we still need this check in GetSnapshotsResponse
. 7.x node never sends more than one repository in the request to 8.x node, however, it could be not the name of the repo, but "repo*" or "_all". Despite being one string, it resolves into multiple repos on 8.x node, so 8.x node might return more than one response and needs to throw an exception if it's actually the case. 71f6aa1
@andrershov just FYI tests are broken here now (just in case no Slack bot informed you about it :)). |
#42090 PR added support for requesting snapshots from multiple repositories. And it has changed the response format in a non-BwC way. There is a mentioning of a response format change in the breaking changes docs, however, there is no example of how new format looks like. Pointed out by @dakrone. This commit adds the missing example.
Due to recent changes done for converting `repository-hdfs` to test clusters (elastic#41252), the `integTestSecure*` tasks did not depend on `secureHdfsFixture` which when running would fail as the fixture would not be available. This commit adds the dependency of fixture to the task. The `secureHdfsFixture` is an AntFixture which is spawned process. Internally it waits for 30 seconds for the resources to be made available. For my local machine it took almost 45 seconds to be available so I have added the wait time as an input to the AntFixture defaults to 30 seconds and set it to 60 seconds in case of secure hdfs fixture. Another problem while running the `secureHdfsFixture` where it would fail due to port not being privileged port (i.e system port, port < 1024). By default datanode address key tries to find a free port and this would later on fail in case it is running in a secure mode. To address this in case of secure mode we find free port below 1024 and set it in the config. The config `DFSConfigKeys.IGNORE_SECURE_PORTS_FOR_TESTING_KEY` is set to `true` but it did not help. https://fisheye.apache.org/browse/~br=branch-2.8.1/hadoop/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/SecureDataNodeStarter.java?hb=true#to140 The integ test for secure hdfs were disabled for long time and so the changes done in elastic#42090 to fix the tests are also done in this commit.
Due to recent changes are done for converting `repository-hdfs` to test clusters (#41252), the `integTestSecure*` tasks did not depend on `secureHdfsFixture` which when running would fail as the fixture would not be available. This commit adds the dependency of the fixture to the task. The `secureHdfsFixture` is a `AntFixture` which is spawned a process. Internally it waits for 30 seconds for the resources to be made available. For my local machine, it took almost 45 seconds to be available so I have added the wait time as an input to the `AntFixture` defaults to 30 seconds and set it to 60 seconds in case of secure hdfs fixture. The integ test for secure hdfs was disabled for a long time and so the changes done in #42090 to fix the tests are also done in this commit.
Due to recent changes are done for converting `repository-hdfs` to test clusters (elastic#41252), the `integTestSecure*` tasks did not depend on `secureHdfsFixture` which when running would fail as the fixture would not be available. This commit adds the dependency of the fixture to the task. The `secureHdfsFixture` is a `AntFixture` which is spawned a process. Internally it waits for 30 seconds for the resources to be made available. For my local machine, it took almost 45 seconds to be available so I have added the wait time as an input to the `AntFixture` defaults to 30 seconds and set it to 60 seconds in case of secure hdfs fixture. The integ test for secure hdfs was disabled for a long time and so the changes done in elastic#42090 to fix the tests are also done in this commit.
Due to recent changes are done for converting `repository-hdfs` to test clusters (elastic#41252), the `integTestSecure*` tasks did not depend on `secureHdfsFixture` which when running would fail as the fixture would not be available. This commit adds the dependency of the fixture to the task. The `secureHdfsFixture` is a `AntFixture` which is spawned a process. Internally it waits for 30 seconds for the resources to be made available. For my local machine, it took almost 45 seconds to be available so I have added the wait time as an input to the `AntFixture` defaults to 30 seconds and set it to 60 seconds in case of secure hdfs fixture. The integ test for secure hdfs was disabled for a long time and so the changes done in elastic#42090 to fix the tests are also done in this commit.
Due to recent changes are done for converting `repository-hdfs` to test clusters (elastic#41252), the `integTestSecure*` tasks did not depend on `secureHdfsFixture` which when running would fail as the fixture would not be available. This commit adds the dependency of the fixture to the task. The `secureHdfsFixture` is a `AntFixture` which is spawned a process. Internally it waits for 30 seconds for the resources to be made available. For my local machine, it took almost 45 seconds to be available so I have added the wait time as an input to the `AntFixture` defaults to 30 seconds and set it to 60 seconds in case of secure hdfs fixture. The integ test for secure hdfs was disabled for a long time and so the changes done in elastic#42090 to fix the tests are also done in this commit.
Due to recent changes are done for converting `repository-hdfs` to test clusters (#41252), the `integTestSecure*` tasks did not depend on `secureHdfsFixture` which when running would fail as the fixture would not be available. This commit adds the dependency of the fixture to the task. The `secureHdfsFixture` is a `AntFixture` which is spawned a process. Internally it waits for 30 seconds for the resources to be made available. For my local machine, it took almost 45 seconds to be available so I have added the wait time as an input to the `AntFixture` defaults to 30 seconds and set it to 60 seconds in case of secure hdfs fixture. The integ test for secure hdfs was disabled for a long time and so the changes done in #42090 to fix the tests are also done in this commit.
Due to recent changes are done for converting `repository-hdfs` to test clusters (#41252), the `integTestSecure*` tasks did not depend on `secureHdfsFixture` which when running would fail as the fixture would not be available. This commit adds the dependency of the fixture to the task. The `secureHdfsFixture` is a `AntFixture` which is spawned a process. Internally it waits for 30 seconds for the resources to be made available. For my local machine, it took almost 45 seconds to be available so I have added the wait time as an input to the `AntFixture` defaults to 30 seconds and set it to 60 seconds in case of secure hdfs fixture. The integ test for secure hdfs was disabled for a long time and so the changes done in #42090 to fix the tests are also done in this commit.
Due to recent changes are done for converting `repository-hdfs` to test clusters (#41252), the `integTestSecure*` tasks did not depend on `secureHdfsFixture` which when running would fail as the fixture would not be available. This commit adds the dependency of the fixture to the task. The `secureHdfsFixture` is a `AntFixture` which is spawned a process. Internally it waits for 30 seconds for the resources to be made available. For my local machine, it took almost 45 seconds to be available so I have added the wait time as an input to the `AntFixture` defaults to 30 seconds and set it to 60 seconds in case of secure hdfs fixture. The integ test for secure hdfs was disabled for a long time and so the changes done in #42090 to fix the tests are also done in this commit.
Description
This commit adds multiple repositories support to get snapshots request.
If some repository throws an exception this method does not fail fast instead, it returns results for all repositories.
This PR is opened in favour of #41799, because we decided to change the response format in a non-BwC manner. It makes sense to read discussion of the aforementioned PR.
This is the continuation of work done here #15151.
_snapshot API
Now the following requests are supported:
This commit breaks BwC of response format for this request, that's why this PR is not backported to 7.x. For the response format see examples sections.
_cat/snapshots API
Cat snapshot API also supports multiple repositories now.
In the table returned new column named "repository" is added. If some repo fails with an exception, this method just reports that some repos have failed. See examples section.
is also supported and will return all snapshots for all repositories.
Interestingly enough, the previous implementation also was registering
/_cat/snapshots
filter but a request to this endpoint always resulted inImplementation
Technically,
TransportGetSnapshotsAction
callsTransportGetRepositoriesAction
which resolves repo names and wildcards in request to concrete repository names.After that, we submit tasks in parallel for all returned repositories and are using the already implemented algorithm to get matching snapshots in the repositories.
In terms of the transport level protocol, 2 classes are changed:
GetSnapshotsRequest
now accepts the list of repository names (including wildcards) instead of single repo name.GetSnapshotsResponse
is changed in non-BwC manner.In terms of REST protocol, 2 classes are changed:
SnapshotRequestsConverters.getSnapshots
now can understand the list of repository names inGetSnapshotRequest
and adds it as comma separated list to the request path.RestGetSnapshotsAction
can now parse the comma-separated list of the repo names.GetSnapshotsResponse
XContent serialization is completely changed.RestSnapshotAction
which is responsible for handling cat API for the snapshots is extended with repository field as well.Testing
SharedClusterSnapshotRestoreIT.testGetSnapshotsMultipleRepos
tests that multiple repositories and wildcard work fine on transport level.SnapshotRequestConvertersTests.getSnapshots
tests that high-level REST client correctly sends the request when multiple repositories are used in the request.SnapshotIT.testGetSnapshots
tests getting 2 snapshots from 2 different repositories using high-level REST client.Documentation
Seems that it makes sense to make adjustments to the following documentation pages:
Both of them should talk about the ability to specify multiple repositories when querying the snapshots (TBD).
Since it's breaking change also breaking changes doc should be extended (TBD).
Examples
Working with
GetSnapshotsResponse
Consider you have response.
First of all, you can check if there are any failures
response.isFailed()
.If you were using wildcard/regex for repository names, you can use
response.getRepositories
to get the set of repositories.If you know particular repository name, you can just do
response.getSnapshots(repoName)
and it will return the list ofSnapshotInfo
for the repository or it throwsElasticsearchException
. So all methods that previously calledresponse.getSnapshots()
can now achieve the same behaviour by just passing therepoName
.Finally, you can get a map of successful and failed responses using
getSuccessfulResponses
andgetFailedResponses
._snapshot API
Consider you have 2 repositories:
And you have one snapshot in each repository:
Now the following commands
will give you the same output
Responses are currently not sorted by repository name. However, SnapshotInfos are still sorted by startDate and then by snapshotId.
The following request
results in
because snap1 exists in repo1, but does not exist in repo2.
This is an example of partial failure.
Note that HTTP status code is 200 OK, despite partial failure.
Even with full failure status code will be 200 OK.
Currently, successful repositories are always reported on top.
If you specify
ignore_unavailable
flag:And get the following response:
If you specify missing repository in the request
You get top-level error:
_cat/snapshot API
Consider you have 2 repositories:
And you have one snapshot in each repository:
Now you can get all the snapshots using any of this API calls:
The response will contain the repository name:
Currently snapshots are not sorted by repo name.
If there is an unexpected error for at least one of the repositories you will get the following
response:
Closes #41210