Retry ILM steps that fail due to SnapshotInProgressException #37624

dakrone · 2019-01-18T23:31:59Z

Some steps, such as steps that delete, close, or freeze an index, may fail due to a currently running snapshot of the index. In those cases, rather than move to the ERROR step, we should retry the step when the snapshot has completed.

This change adds an abstract step (AsyncRetryDuringSnapshotActionStep) that certain steps (like the ones I mentioned above) can extend that will automatically handle a situation where a snapshot is taking place. When a SnapshotInProgressException is received by the listener wrapper, a ClusterStateObserver listener is registered to wait until the snapshot has completed, re-running the ILM action when no snapshot is occurring.

This also adds integration tests for these scenarios (thanks to @talevy in #37552).

Resolves #37541

…y-after-snapshot-fail

…pshot-fail

Also adds javadocs

elasticmachine · 2019-01-18T23:32:01Z

Pinging @elastic/es-core-features

talevy

LGTM

I discussed this outside of Github with @dakrone, but we agreed
that unit tests for AsyncRetryDuringSnapshotActionStep's
SnapshotExceptionListener and NoSnapshotRunningListener
would only cover some non-integral branches of the code logic in
retrying actions via the cluster-state-observable. Since there
is confidence that the existing integration tests in this PR cover
the successful retry, that represents the critical path and sufficient
for verifying these changes do what they intend.

talevy · 2019-01-22T18:30:20Z

...e/src/test/java/org/elasticsearch/xpack/core/indexlifecycle/CloseFollowerIndexStepTests.java

@@ -114,4 +109,30 @@ public void onFailure(Exception e) {
        Mockito.verify(indicesClient).close(Mockito.any(), Mockito.any());
        Mockito.verifyNoMoreInteractions(indicesClient);
    }
+
+    @Override
+    protected CloseFollowerIndexStep createRandomInstance() {


talevy · 2019-01-22T18:43:21Z

...ain/java/org/elasticsearch/xpack/core/indexlifecycle/AsyncRetryDuringSnapshotActionStep.java

+                            performAction(idxMeta, state, observer, originalListener);
+                        }, originalListener::onFailure),
+                        // TODO: what is a good timeout value for no new state received during this time?
+                        TimeValue.timeValueHours(12));


I think waiting 12 hours for a snapshot to finish is reasonable. If there is no progress on this action in that time interval, a user may want to know. so 👍

talevy · 2019-01-22T19:42:12Z

backport to 6.x is blocked on #37723 (SnapshotInProgressException)

talevy · 2019-01-23T07:02:30Z

update: above blocker PR for 6.x was merged

gwbrown

LGTM

@talevy

Some steps, such as steps that delete, close, or freeze an index, may fail due to a currently running snapshot of the index. In those cases, rather than move to the ERROR step, we should retry the step when the snapshot has completed. This change adds an abstract step (`AsyncRetryDuringSnapshotActionStep`) that certain steps (like the ones I mentioned above) can extend that will automatically handle a situation where a snapshot is taking place. When a `SnapshotInProgressException` is received by the listener wrapper, a `ClusterStateObserver` listener is registered to wait until the snapshot has completed, re-running the ILM action when no snapshot is occurring. This also adds integration tests for these scenarios (thanks to @talevy in #37552). Resolves #37541

* master: Liberalize StreamOutput#writeStringList (elastic#37768) Add PersistentTasksClusterService::unassignPersistentTask method (elastic#37576) Tests: disable testRandomGeoCollectionQuery on tiny polygons (elastic#37579) Use ILM for Watcher history deletion (elastic#37443) Make sure PutMappingRequest accepts content types other than JSON. (elastic#37720) Retry ILM steps that fail due to SnapshotInProgressException (elastic#37624) Use disassociate in preference to deassociate (elastic#37704) Delete Redundant RoutingServiceTests (elastic#37750) Always return metadata version if metadata is requested (elastic#37674)

* elastic/master: (85 commits) Use explicit version for build-tools in example plugin integ tests (elastic#37792) Change `rational` to `saturation` in script_score (elastic#37766) Deprecate types in get field mapping API (elastic#37667) Add ability to listen to group of affix settings (elastic#37679) Ensure changes requests return the latest mapping version (elastic#37633) Make Minio Setup more Reliable (elastic#37747) Liberalize StreamOutput#writeStringList (elastic#37768) Add PersistentTasksClusterService::unassignPersistentTask method (elastic#37576) Tests: disable testRandomGeoCollectionQuery on tiny polygons (elastic#37579) Use ILM for Watcher history deletion (elastic#37443) Make sure PutMappingRequest accepts content types other than JSON. (elastic#37720) Retry ILM steps that fail due to SnapshotInProgressException (elastic#37624) Use disassociate in preference to deassociate (elastic#37704) Delete Redundant RoutingServiceTests (elastic#37750) Always return metadata version if metadata is requested (elastic#37674) [TEST] Mute MlMappingsUpgradeIT testMappingsUpgrade Streamline skip_unavailable handling (elastic#37672) Only bootstrap and elect node in current voting configuration (elastic#37712) Ensure either success or failure path for SearchOperationListener is called (elastic#37467) Target only specific index in update settings test ...

talevy and others added 22 commits January 16, 2019 22:21

add test for running certain ILM actions during snapshotting

7aeec21

swap the tests to awaitsfix and test that things succeed

a97d0eb

fix getSnapshotState

e8c43de

fix checkstyle

273b932

Merge remote-tracking branch 'upstream/master' into ilm-snapshot-test

d57f589

WIP

d15507a

Add RetryDuringSnapshotStep

545a40d

Move DeleteStep to use RetryDuringSnapshotStep

af6ad54

Move to real SnapshotInProgressException

8cf8852

Add license header

b5ff014

Call original listener onFailure if it was not a snapshot exception

ce1f998

Checkstyle line length fixes

cc2b329

Use RetryDuringSnapshotStep for FreezeStep as well

b36c48a

Merge remote-tracking branch 'talevy/ilm-snapshot-test' into ilm-retr…

cd2bad7

…y-after-snapshot-fail

Unawaitsfix the tests, fix RetryDuringSnapshotStep

2d5dd8d

Merge remote-tracking branch 'origin/master' into ilm-retry-after-sna…

2b1746c

…pshot-fail

Fix for unfollow steps after master merge

f1fb55f

Move CloseFollowerIndexStep to extend RetryDuringSnapshotStep

9818012

Add a test for unfollow while a snapshot is ongoing

76099b9

Add some debug logging for the snapshot retry

34f9444

Rename RetryDuringSnapshotStep -> AsyncRetryDuringSnapshotActionStep

f95b290

Also adds javadocs

Be paranoid about exceptions being thrown and swallowed on accident

ffdc5fd

dakrone added >bug blocker v7.0.0 :Data Management/ILM+SLM Index and Snapshot lifecycle management v6.7.0 labels Jan 18, 2019

dakrone requested review from talevy and gwbrown January 18, 2019 23:32

talevy approved these changes Jan 22, 2019

View reviewed changes

gwbrown approved these changes Jan 23, 2019

View reviewed changes

dakrone merged commit 647e225 into elastic:master Jan 23, 2019

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry ILM steps that fail due to SnapshotInProgressException #37624

Retry ILM steps that fail due to SnapshotInProgressException #37624

dakrone commented Jan 18, 2019

elasticmachine commented Jan 18, 2019

talevy left a comment

talevy Jan 22, 2019

talevy Jan 22, 2019

talevy commented Jan 22, 2019

talevy commented Jan 23, 2019

gwbrown left a comment

Retry ILM steps that fail due to SnapshotInProgressException #37624

Retry ILM steps that fail due to SnapshotInProgressException #37624

Conversation

dakrone commented Jan 18, 2019

elasticmachine commented Jan 18, 2019

talevy left a comment

Choose a reason for hiding this comment

talevy Jan 22, 2019

Choose a reason for hiding this comment

talevy Jan 22, 2019

Choose a reason for hiding this comment

talevy commented Jan 22, 2019

talevy commented Jan 23, 2019

gwbrown left a comment

Choose a reason for hiding this comment