-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry ILM steps that fail due to SnapshotInProgressException #37624
Retry ILM steps that fail due to SnapshotInProgressException #37624
Conversation
…y-after-snapshot-fail
Also adds javadocs
Pinging @elastic/es-core-features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I discussed this outside of Github with @dakrone, but we agreed
that unit tests for AsyncRetryDuringSnapshotActionStep
's
SnapshotExceptionListener
and NoSnapshotRunningListener
would only cover some non-integral branches of the code logic in
retrying actions via the cluster-state-observable. Since there
is confidence that the existing integration tests in this PR cover
the successful retry, that represents the critical path and sufficient
for verifying these changes do what they intend.
@@ -114,4 +109,30 @@ public void onFailure(Exception e) { | |||
Mockito.verify(indicesClient).close(Mockito.any(), Mockito.any()); | |||
Mockito.verifyNoMoreInteractions(indicesClient); | |||
} | |||
|
|||
@Override | |||
protected CloseFollowerIndexStep createRandomInstance() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
performAction(idxMeta, state, observer, originalListener); | ||
}, originalListener::onFailure), | ||
// TODO: what is a good timeout value for no new state received during this time? | ||
TimeValue.timeValueHours(12)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think waiting 12 hours for a snapshot to finish is reasonable. If there is no progress on this action in that time interval, a user may want to know. so 👍
backport to 6.x is blocked on #37723 (SnapshotInProgressException) |
update: above blocker PR for 6.x was merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Some steps, such as steps that delete, close, or freeze an index, may fail due to a currently running snapshot of the index. In those cases, rather than move to the ERROR step, we should retry the step when the snapshot has completed. This change adds an abstract step (`AsyncRetryDuringSnapshotActionStep`) that certain steps (like the ones I mentioned above) can extend that will automatically handle a situation where a snapshot is taking place. When a `SnapshotInProgressException` is received by the listener wrapper, a `ClusterStateObserver` listener is registered to wait until the snapshot has completed, re-running the ILM action when no snapshot is occurring. This also adds integration tests for these scenarios (thanks to @talevy in #37552). Resolves #37541
* master: Liberalize StreamOutput#writeStringList (elastic#37768) Add PersistentTasksClusterService::unassignPersistentTask method (elastic#37576) Tests: disable testRandomGeoCollectionQuery on tiny polygons (elastic#37579) Use ILM for Watcher history deletion (elastic#37443) Make sure PutMappingRequest accepts content types other than JSON. (elastic#37720) Retry ILM steps that fail due to SnapshotInProgressException (elastic#37624) Use disassociate in preference to deassociate (elastic#37704) Delete Redundant RoutingServiceTests (elastic#37750) Always return metadata version if metadata is requested (elastic#37674)
* elastic/master: (85 commits) Use explicit version for build-tools in example plugin integ tests (elastic#37792) Change `rational` to `saturation` in script_score (elastic#37766) Deprecate types in get field mapping API (elastic#37667) Add ability to listen to group of affix settings (elastic#37679) Ensure changes requests return the latest mapping version (elastic#37633) Make Minio Setup more Reliable (elastic#37747) Liberalize StreamOutput#writeStringList (elastic#37768) Add PersistentTasksClusterService::unassignPersistentTask method (elastic#37576) Tests: disable testRandomGeoCollectionQuery on tiny polygons (elastic#37579) Use ILM for Watcher history deletion (elastic#37443) Make sure PutMappingRequest accepts content types other than JSON. (elastic#37720) Retry ILM steps that fail due to SnapshotInProgressException (elastic#37624) Use disassociate in preference to deassociate (elastic#37704) Delete Redundant RoutingServiceTests (elastic#37750) Always return metadata version if metadata is requested (elastic#37674) [TEST] Mute MlMappingsUpgradeIT testMappingsUpgrade Streamline skip_unavailable handling (elastic#37672) Only bootstrap and elect node in current voting configuration (elastic#37712) Ensure either success or failure path for SearchOperationListener is called (elastic#37467) Target only specific index in update settings test ...
Some steps, such as steps that delete, close, or freeze an index, may fail due to a currently running snapshot of the index. In those cases, rather than move to the ERROR step, we should retry the step when the snapshot has completed.
This change adds an abstract step (
AsyncRetryDuringSnapshotActionStep
) that certain steps (like the ones I mentioned above) can extend that will automatically handle a situation where a snapshot is taking place. When aSnapshotInProgressException
is received by the listener wrapper, aClusterStateObserver
listener is registered to wait until the snapshot has completed, re-running the ILM action when no snapshot is occurring.This also adds integration tests for these scenarios (thanks to @talevy in #37552).
Resolves #37541