-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Issue with Concurrent Snapshot Init + Delete #38518
Conversation
original-brownbear
commented
Feb 6, 2019
•
edited
Loading
edited
- Ensures we're not finalizing a snapshot in the repository while it is initializing on another thread
- Closes [CI] SharedClusterSnapshotRestoreIT.testAbortedSnapshotDuringInitDoesNotStart Fails #38489
Pinging @elastic/es-distributed |
@@ -701,8 +714,8 @@ public void applyClusterState(ClusterChangedEvent event) { | |||
// 3. Snapshots in any other state that have all their shard tasks completed | |||
snapshotsInProgress.entries().stream().filter( | |||
entry -> entry.state().completed() | |||
|| entry.state() == State.INIT && initializingSnapshots.contains(entry.snapshot()) == false | |||
|| entry.state() != State.INIT && completed(entry.shards().values()) | |||
|| initializingSnapshots.contains(entry.snapshot()) == false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we were finalising all snapshots that had all their shards completed here.
This caught the case where a snapshot had just went from INIT
to ABORTED
and was finalised in the repository when at the same time we were initialising it in https://github.com/elastic/elasticsearch/pull/38518/files#diff-a0853be4492c052f24917b5c1464003dR413 (this is what the failing test in #38489 forced by setting the on init block in the MockRepository
).
With this change we will never finalise a snapshot here that is still being initialised.
@@ -433,6 +434,8 @@ public ClusterState execute(ClusterState currentState) { | |||
|
|||
if (entry.state() == State.ABORTED) { | |||
entries.add(entry); | |||
assert entry.shards().isEmpty(); | |||
hadAbortedInitializations = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we now don't finalise this case in applyClusterState
we need to end it when the cluster state for the init has been processed, so we set the flag here to mark that case.
Jenkins run elasticsearch-ci/1 |
Jenkins test this |
1 similar comment
Jenkins test this |
Jenkins run elasticsearch-ci/1 |
|
||
if (hadAbortedInitializations) { | ||
final SnapshotsInProgress snapshotsInProgress = newState.custom(SnapshotsInProgress.TYPE); | ||
if (snapshotsInProgress != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should be able to assert != null here?
final SnapshotsInProgress snapshotsInProgress = newState.custom(SnapshotsInProgress.TYPE); | ||
if (snapshotsInProgress != null) { | ||
final SnapshotsInProgress.Entry entry = snapshotsInProgress.snapshot(snapshot.snapshot()); | ||
if (entry != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should be able to assert != null here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I was paranoid about potentially throwing an NPE in the cluster state thread :D but yea, assert
should in theory be just fine => moving to assert
:)
Jenkins run elasticsearch-ci/1 |
1 similar comment
Jenkins run elasticsearch-ci/1 |
@original-brownbear I guess once this is fixed on master it is going to be backported to at least 7.0? Just asking because we got failures on that branch there today. |
@cbuescher yea should be possible to backport this to any affected branch and I'll do that. Added the labels :) |
* master: Mute AnalysisModuleTests#testStandardFilterBWC (elastic#38636) add geotile_grid ref to asciidoc (elastic#38632) Enable Dockerfile from artifacts.elastic.co (elastic#38552) Mute FollowerFailOverIT testFailOverOnFollower (elastic#38634) Account for a possible rolled over file while reading the audit log file (elastic#34909) Mute failure in InternalEngineTests (elastic#38622) Fix Issue with Concurrent Snapshot Init + Delete (elastic#38518) Refactor ZonedDateTime.now in millis resolution (elastic#38577) Mute failing WatchStatusIntegrationTests (elastic#38621) Mute failing ApiKeyIntegTests (elastic#38614) [DOCS] Add warning about bypassing ML PUT APIs (elastic#38509) Add 7.1 and 8.0 version constants to master (elastic#38514)
* master: (1159 commits) Fix timezone fallback in ingest processor (elastic#38407) Avoid polluting download stats on builds (elastic#38660) SQL: Prevent grouping over grouping functions (elastic#38649) SQL: Relax StackOverflow circuit breaker for constants (elastic#38572) [DOCS] Fixes broken migration links (elastic#38655) Drop support for the low-level REST client on JDK 7 (elastic#38540) [DOCS] Adds placeholders for v8 highlights, breaking changes, release notes (elastic#38641) fix dissect doc "ip" --> "clientip" (elastic#38545) Concurrent file chunk fetching for CCR restore (elastic#38495) make DateMathIndexExpressionsIntegrationIT more resilient (elastic#38473) SQL: Replace joda with java time (elastic#38437) Add fuzziness example (elastic#37194) (elastic#38648) Mute AnalysisModuleTests#testStandardFilterBWC (elastic#38636) add geotile_grid ref to asciidoc (elastic#38632) Enable Dockerfile from artifacts.elastic.co (elastic#38552) Mute FollowerFailOverIT testFailOverOnFollower (elastic#38634) Account for a possible rolled over file while reading the audit log file (elastic#34909) Mute failure in InternalEngineTests (elastic#38622) Fix Issue with Concurrent Snapshot Init + Delete (elastic#38518) Refactor ZonedDateTime.now in millis resolution (elastic#38577) ...
* master: (27 commits) Mute AnalysisModuleTests#testStandardFilterBWC (elastic#38636) add geotile_grid ref to asciidoc (elastic#38632) Enable Dockerfile from artifacts.elastic.co (elastic#38552) Mute FollowerFailOverIT testFailOverOnFollower (elastic#38634) Account for a possible rolled over file while reading the audit log file (elastic#34909) Mute failure in InternalEngineTests (elastic#38622) Fix Issue with Concurrent Snapshot Init + Delete (elastic#38518) Refactor ZonedDateTime.now in millis resolution (elastic#38577) Mute failing WatchStatusIntegrationTests (elastic#38621) Mute failing ApiKeyIntegTests (elastic#38614) [DOCS] Add warning about bypassing ML PUT APIs (elastic#38509) Add 7.1 and 8.0 version constants to master (elastic#38514) ML: update set_upgrade_mode, add logging (elastic#38372) bad formatted JSON object (elastic#38515) (elastic#38525) Fix HistoryIntegrationTests timestamp comparsion (elastic#38505) SQL: Fix issue with IN not resolving to underlying keyword field (elastic#38440) Fix the clock resolution to millis in ScheduledEventTests (elastic#38506) Enable BWC after backport recovering leases (elastic#38485) Collapse retention lease integration tests (elastic#38483) TransportVerifyShardBeforeCloseAction should force a flush (elastic#38401) ...
* Fix Issue with Concurrent Snapshot Init + Delete by ensuring that we're not finalizing a snapshot in the repository while it is initializing on another thread * Closes elastic#38489
* Fix Issue with Concurrent Snapshot Init + Delete by ensuring that we're not finalizing a snapshot in the repository while it is initializing on another thread * Closes #38489
I went ahead and backported this since it was still causing problems in CI |