Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Issue with Concurrent Snapshot Init + Delete #38518

Merged
merged 5 commits into from
Feb 8, 2019

Conversation

original-brownbear
Copy link
Member

@original-brownbear original-brownbear commented Feb 6, 2019

@original-brownbear original-brownbear added >non-issue :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 labels Feb 6, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@@ -701,8 +714,8 @@ public void applyClusterState(ClusterChangedEvent event) {
// 3. Snapshots in any other state that have all their shard tasks completed
snapshotsInProgress.entries().stream().filter(
entry -> entry.state().completed()
|| entry.state() == State.INIT && initializingSnapshots.contains(entry.snapshot()) == false
|| entry.state() != State.INIT && completed(entry.shards().values())
|| initializingSnapshots.contains(entry.snapshot()) == false
Copy link
Member Author

@original-brownbear original-brownbear Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we were finalising all snapshots that had all their shards completed here.
This caught the case where a snapshot had just went from INIT to ABORTED and was finalised in the repository when at the same time we were initialising it in https://github.com/elastic/elasticsearch/pull/38518/files#diff-a0853be4492c052f24917b5c1464003dR413 (this is what the failing test in #38489 forced by setting the on init block in the MockRepository).

With this change we will never finalise a snapshot here that is still being initialised.

@@ -433,6 +434,8 @@ public ClusterState execute(ClusterState currentState) {

if (entry.state() == State.ABORTED) {
entries.add(entry);
assert entry.shards().isEmpty();
hadAbortedInitializations = true;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we now don't finalise this case in applyClusterState we need to end it when the cluster state for the init has been processed, so we set the flag here to mark that case.

@original-brownbear
Copy link
Member Author

Jenkins run elasticsearch-ci/1

@original-brownbear
Copy link
Member Author

Jenkins test this

1 similar comment
@original-brownbear
Copy link
Member Author

Jenkins test this

@original-brownbear
Copy link
Member Author

Jenkins run elasticsearch-ci/1
Jenkins run elasticsearch-ci/2
Jenkins run elasticsearch-ci/default-distro


if (hadAbortedInitializations) {
final SnapshotsInProgress snapshotsInProgress = newState.custom(SnapshotsInProgress.TYPE);
if (snapshotsInProgress != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should be able to assert != null here?

final SnapshotsInProgress snapshotsInProgress = newState.custom(SnapshotsInProgress.TYPE);
if (snapshotsInProgress != null) {
final SnapshotsInProgress.Entry entry = snapshotsInProgress.snapshot(snapshot.snapshot());
if (entry != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should be able to assert != null here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I was paranoid about potentially throwing an NPE in the cluster state thread :D but yea, assert should in theory be just fine => moving to assert :)

@original-brownbear
Copy link
Member Author

Jenkins run elasticsearch-ci/1
Jenkins run elasticsearch-ci/2
Jenkins run elasticsearch-ci/default-distro

1 similar comment
@original-brownbear
Copy link
Member Author

Jenkins run elasticsearch-ci/1
Jenkins run elasticsearch-ci/2
Jenkins run elasticsearch-ci/default-distro

@cbuescher
Copy link
Member

@original-brownbear I guess once this is fixed on master it is going to be backported to at least 7.0? Just asking because we got failures on that branch there today.

@original-brownbear
Copy link
Member Author

@cbuescher yea should be possible to backport this to any affected branch and I'll do that. Added the labels :)

@original-brownbear original-brownbear merged commit b35d3f0 into elastic:master Feb 8, 2019
@original-brownbear original-brownbear deleted the 38489 branch February 8, 2019 14:59
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Feb 8, 2019
* master:
  Mute AnalysisModuleTests#testStandardFilterBWC (elastic#38636)
  add geotile_grid ref to asciidoc (elastic#38632)
  Enable Dockerfile from artifacts.elastic.co (elastic#38552)
  Mute FollowerFailOverIT testFailOverOnFollower (elastic#38634)
  Account for a possible rolled over file while reading the audit log file (elastic#34909)
  Mute failure in InternalEngineTests (elastic#38622)
  Fix Issue with Concurrent Snapshot Init + Delete (elastic#38518)
  Refactor ZonedDateTime.now in millis resolution (elastic#38577)
  Mute failing WatchStatusIntegrationTests (elastic#38621)
  Mute failing  ApiKeyIntegTests (elastic#38614)
  [DOCS] Add warning about bypassing ML PUT APIs (elastic#38509)
  Add 7.1 and 8.0 version constants to master (elastic#38514)
jasontedor added a commit to liketic/elasticsearch that referenced this pull request Feb 10, 2019
* master: (1159 commits)
  Fix timezone fallback in ingest processor (elastic#38407)
  Avoid polluting download stats on builds (elastic#38660)
  SQL: Prevent grouping over grouping functions (elastic#38649)
  SQL: Relax StackOverflow circuit breaker for constants (elastic#38572)
  [DOCS] Fixes broken migration links (elastic#38655)
  Drop support for the low-level REST client on JDK 7 (elastic#38540)
  [DOCS] Adds placeholders for v8 highlights, breaking changes, release notes (elastic#38641)
  fix dissect doc "ip" --> "clientip" (elastic#38545)
  Concurrent file chunk fetching for CCR restore (elastic#38495)
  make DateMathIndexExpressionsIntegrationIT more resilient (elastic#38473)
  SQL: Replace joda with java time (elastic#38437)
  Add fuzziness example (elastic#37194) (elastic#38648)
  Mute AnalysisModuleTests#testStandardFilterBWC (elastic#38636)
  add geotile_grid ref to asciidoc (elastic#38632)
  Enable Dockerfile from artifacts.elastic.co (elastic#38552)
  Mute FollowerFailOverIT testFailOverOnFollower (elastic#38634)
  Account for a possible rolled over file while reading the audit log file (elastic#34909)
  Mute failure in InternalEngineTests (elastic#38622)
  Fix Issue with Concurrent Snapshot Init + Delete (elastic#38518)
  Refactor ZonedDateTime.now in millis resolution (elastic#38577)
  ...
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Feb 11, 2019
* master: (27 commits)
  Mute AnalysisModuleTests#testStandardFilterBWC (elastic#38636)
  add geotile_grid ref to asciidoc (elastic#38632)
  Enable Dockerfile from artifacts.elastic.co (elastic#38552)
  Mute FollowerFailOverIT testFailOverOnFollower (elastic#38634)
  Account for a possible rolled over file while reading the audit log file (elastic#34909)
  Mute failure in InternalEngineTests (elastic#38622)
  Fix Issue with Concurrent Snapshot Init + Delete (elastic#38518)
  Refactor ZonedDateTime.now in millis resolution (elastic#38577)
  Mute failing WatchStatusIntegrationTests (elastic#38621)
  Mute failing  ApiKeyIntegTests (elastic#38614)
  [DOCS] Add warning about bypassing ML PUT APIs (elastic#38509)
  Add 7.1 and 8.0 version constants to master (elastic#38514)
  ML: update set_upgrade_mode, add logging (elastic#38372)
  bad formatted JSON object (elastic#38515) (elastic#38525)
  Fix HistoryIntegrationTests timestamp comparsion (elastic#38505)
  SQL: Fix issue with IN not resolving to underlying keyword field (elastic#38440)
  Fix the clock resolution to millis in ScheduledEventTests (elastic#38506)
  Enable BWC after backport recovering leases (elastic#38485)
  Collapse retention lease integration tests (elastic#38483)
  TransportVerifyShardBeforeCloseAction should force a flush (elastic#38401)
  ...
talevy pushed a commit to talevy/elasticsearch that referenced this pull request Feb 15, 2019
* Fix Issue with Concurrent Snapshot Init + Delete by ensuring that we're not finalizing a snapshot in the repository while it is initializing on another thread

* Closes elastic#38489
talevy added a commit that referenced this pull request Feb 16, 2019
* Fix Issue with Concurrent Snapshot Init + Delete by ensuring that we're not finalizing a snapshot in the repository while it is initializing on another thread

* Closes #38489
talevy pushed a commit that referenced this pull request Feb 16, 2019
* Fix Issue with Concurrent Snapshot Init + Delete by ensuring that we're not finalizing a snapshot in the repository while it is initializing on another thread

* Closes #38489
@talevy
Copy link
Contributor

talevy commented Feb 16, 2019

I went ahead and backported this since it was still causing problems in CI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] SharedClusterSnapshotRestoreIT.testAbortedSnapshotDuringInitDoesNotStart Fails
8 participants