Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup Concurrent RepositoryData Loading #48329

Merged
merged 5 commits into from
Oct 23, 2019

Conversation

original-brownbear
Copy link
Member

The loading of RepositoryData is not an atomic operation.
It uses a list + get combination of calls.
This lead to accidentally returning an empty repository data
for generations >=0 which can never not exist unless the repository
is corrupted.
In the test #48122 (and other SLM tests) there was a low chance of
running into this concurrent modification scenario and the repository
actually moving two index generations between listing out the
index-N and loading the latest version of it. Since we only keep
two index-N around at a time this lead to unexpectedly absent
snapshots in status APIs.
Fixing the behavior to be more resilient is non-trivial but in the works.
For now I think we should simply throw in this scenario. This will also
help prevent corruption in the unlikely event but possible of running into this
issue in a snapshot create or delete operation on master failover on a
repository like S3 which doesn't have the "no overwrites" protection on
writing a new index-N.

I would suggest back porting this all the way since it can theoretically have
repository corrupting effects on S3.

Fixes #48122

The loading of `RepositoryData` is not an atomic operation.
It uses a list + get combination of calls.
This lead to accidentally returning an empty repository data
for generations >=0 which can never not exist unless the repository
is corrupted.
In the test elastic#48122 (and other SLM tests) there was a low chance of
running into this concurrent modification scenario and the repository
actually moving two index generations between listing out the
index-N and loading the latest version of it. Since we only keep
two index-N around at a time this lead to unexpectedly absent
snapshots in status APIs.
Fixing the behavior to be more resilient is non-trivial but in the works.
For now I think we should simply throw in this scenario. This will also
help prevent corruption in the unlikely event but possible of running into this
issue in a snapshot create or delete operation on master failover on a
repository like S3 which doesn't have the "no overwrites" protection on
writing a new index-N.

Fixes elastic#48122
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@original-brownbear
Copy link
Member Author

I didn't yet get around to a full test run here, so if this doesn't yet go green please hold off reviewing till I can fix the tests I may have missed :)

@@ -812,9 +812,6 @@ public void endVerification(String seed) {
public RepositoryData getRepositoryData() {
try {
return getRepositoryData(latestIndexBlobId());
} catch (NoSuchFileException ex) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no good reason to catch here or below. We break out on generation -1 and return empty data so whenever we fail to find data for gen. >= 0 it's a problem and we mustn't ignore it.

@original-brownbear
Copy link
Member Author

original-brownbear commented Oct 22, 2019

Jup, this is not entirely unsurprisingly failing some REST tests:

Caused by: org.elasticsearch.ElasticsearchException: Elasticsearch exception [type=no_such_file_exception, reason=/dev/shm/elastic+elasticsearch+pull-request-1/client/rest-high-level/build/testclusters/integTest-0/repo/index-16]
	at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:496)
	at org.elasticsearch.ElasticsearchException.fromXContent(ElasticsearchException.java:407)
	at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:437)
	at org.elasticsearch.ElasticsearchException.failureFromXContent(ElasticsearchException.java:603)
	at org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:169)
	... 45 more

-> WIP

EDIT: #46250 fixed the tests here sort of by accident by changing the order of shard and root level meta updates :) -> should be good to review now

Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Armin

@@ -925,9 +922,6 @@ private RepositoryData getRepositoryData(long indexGen) {
repositoryData = RepositoryData.snapshotsFromXContent(parser, indexGen);
}
return repositoryData;
} catch (NoSuchFileException ex) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: while we are at it, there are unneeded Long.toString() and local variable repositoryData usages

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find it right now, but others disagreed with removing the Long.toString in another PR :( But yea let's clean up the redundant locals :)

@original-brownbear
Copy link
Member Author

Thanks Tanguy!

@original-brownbear original-brownbear merged commit a0f80bd into elastic:master Oct 23, 2019
@original-brownbear original-brownbear deleted the 48122 branch October 23, 2019 10:33
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Oct 23, 2019
Just like elastic#48329 (and using the changes) in that PR
we can run into a concurrent repo modification that we
will throw on and must retry until consistent handling of
this situation is implemented.

Closes elastic#47384
original-brownbear added a commit that referenced this pull request Oct 24, 2019
Just like #48329 (and using the changes) in that PR
we can run into a concurrent repo modification that we
will throw on and must retry until consistent handling of
this situation is implemented.

Closes #47834
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Nov 2, 2019
The loading of `RepositoryData` is not an atomic operation.
It uses a list + get combination of calls.
This lead to accidentally returning an empty repository data
for generations >=0 which can never not exist unless the repository
is corrupted.
In the test elastic#48122 (and other SLM tests) there was a low chance of
running into this concurrent modification scenario and the repository
actually moving two index generations between listing out the
index-N and loading the latest version of it. Since we only keep
two index-N around at a time this lead to unexpectedly absent
snapshots in status APIs.
Fixing the behavior to be more resilient is non-trivial but in the works.
For now I think we should simply throw in this scenario. This will also
help prevent corruption in the unlikely event but possible of running into this
issue in a snapshot create or delete operation on master failover on a
repository like S3 which doesn't have the "no overwrites" protection on
writing a new index-N.

Fixes elastic#48122
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Nov 2, 2019
The loading of `RepositoryData` is not an atomic operation.
It uses a list + get combination of calls.
This lead to accidentally returning an empty repository data
for generations >=0 which can never not exist unless the repository
is corrupted.
In the test elastic#48122 (and other SLM tests) there was a low chance of
running into this concurrent modification scenario and the repository
actually moving two index generations between listing out the
index-N and loading the latest version of it. Since we only keep
two index-N around at a time this lead to unexpectedly absent
snapshots in status APIs.
Fixing the behavior to be more resilient is non-trivial but in the works.
For now I think we should simply throw in this scenario. This will also
help prevent corruption in the unlikely event but possible of running into this
issue in a snapshot create or delete operation on master failover on a
repository like S3 which doesn't have the "no overwrites" protection on
writing a new index-N.

Fixes elastic#48122
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Nov 2, 2019
The loading of `RepositoryData` is not an atomic operation.
It uses a list + get combination of calls.
This lead to accidentally returning an empty repository data
for generations >=0 which can never not exist unless the repository
is corrupted.
In the test elastic#48122 (and other SLM tests) there was a low chance of
running into this concurrent modification scenario and the repository
actually moving two index generations between listing out the
index-N and loading the latest version of it. Since we only keep
two index-N around at a time this lead to unexpectedly absent
snapshots in status APIs.
Fixing the behavior to be more resilient is non-trivial but in the works.
For now I think we should simply throw in this scenario. This will also
help prevent corruption in the unlikely event but possible of running into this
issue in a snapshot create or delete operation on master failover on a
repository like S3 which doesn't have the "no overwrites" protection on
writing a new index-N.

Fixes elastic#48122
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Nov 2, 2019
The loading of `RepositoryData` is not an atomic operation.
It uses a list + get combination of calls.
This lead to accidentally returning an empty repository data
for generations >=0 which can never not exist unless the repository
is corrupted.
In the test elastic#48122 (and other SLM tests) there was a low chance of
running into this concurrent modification scenario and the repository
actually moving two index generations between listing out the
index-N and loading the latest version of it. Since we only keep
two index-N around at a time this lead to unexpectedly absent
snapshots in status APIs.
Fixing the behavior to be more resilient is non-trivial but in the works.
For now I think we should simply throw in this scenario. This will also
help prevent corruption in the unlikely event but possible of running into this
issue in a snapshot create or delete operation on master failover on a
repository like S3 which doesn't have the "no overwrites" protection on
writing a new index-N.

Fixes elastic#48122
original-brownbear added a commit that referenced this pull request Nov 2, 2019
The loading of `RepositoryData` is not an atomic operation.
It uses a list + get combination of calls.
This lead to accidentally returning an empty repository data
for generations >=0 which can never not exist unless the repository
is corrupted.
In the test #48122 (and other SLM tests) there was a low chance of
running into this concurrent modification scenario and the repository
actually moving two index generations between listing out the
index-N and loading the latest version of it. Since we only keep
two index-N around at a time this lead to unexpectedly absent
snapshots in status APIs.
Fixing the behavior to be more resilient is non-trivial but in the works.
For now I think we should simply throw in this scenario. This will also
help prevent corruption in the unlikely event but possible of running into this
issue in a snapshot create or delete operation on master failover on a
repository like S3 which doesn't have the "no overwrites" protection on
writing a new index-N.

Fixes #48122
original-brownbear added a commit that referenced this pull request Nov 2, 2019
The loading of `RepositoryData` is not an atomic operation.
It uses a list + get combination of calls.
This lead to accidentally returning an empty repository data
for generations >=0 which can never not exist unless the repository
is corrupted.
In the test #48122 (and other SLM tests) there was a low chance of
running into this concurrent modification scenario and the repository
actually moving two index generations between listing out the
index-N and loading the latest version of it. Since we only keep
two index-N around at a time this lead to unexpectedly absent
snapshots in status APIs.
Fixing the behavior to be more resilient is non-trivial but in the works.
For now I think we should simply throw in this scenario. This will also
help prevent corruption in the unlikely event but possible of running into this
issue in a snapshot create or delete operation on master failover on a
repository like S3 which doesn't have the "no overwrites" protection on
writing a new index-N.

Fixes #48122
@original-brownbear original-brownbear restored the 48122 branch January 6, 2021 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] SLMSnapshotBlockingIntegTests.testBasicPartialRetention failed
4 participants