Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry ILM steps that fail due to SnapshotInProgressException #37624

Merged
merged 22 commits into from
Jan 23, 2019

Conversation

dakrone
Copy link
Member

@dakrone dakrone commented Jan 18, 2019

Some steps, such as steps that delete, close, or freeze an index, may fail due to a currently running snapshot of the index. In those cases, rather than move to the ERROR step, we should retry the step when the snapshot has completed.

This change adds an abstract step (AsyncRetryDuringSnapshotActionStep) that certain steps (like the ones I mentioned above) can extend that will automatically handle a situation where a snapshot is taking place. When a SnapshotInProgressException is received by the listener wrapper, a ClusterStateObserver listener is registered to wait until the snapshot has completed, re-running the ILM action when no snapshot is occurring.

This also adds integration tests for these scenarios (thanks to @talevy in #37552).

Resolves #37541

@dakrone dakrone added >bug blocker v7.0.0 :Data Management/ILM+SLM Index and Snapshot lifecycle management v6.7.0 labels Jan 18, 2019
@dakrone dakrone requested review from talevy and gwbrown January 18, 2019 23:32
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

Copy link
Contributor

@talevy talevy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I discussed this outside of Github with @dakrone, but we agreed
that unit tests for AsyncRetryDuringSnapshotActionStep's
SnapshotExceptionListener and NoSnapshotRunningListener
would only cover some non-integral branches of the code logic in
retrying actions via the cluster-state-observable. Since there
is confidence that the existing integration tests in this PR cover
the successful retry, that represents the critical path and sufficient
for verifying these changes do what they intend.

@@ -114,4 +109,30 @@ public void onFailure(Exception e) {
Mockito.verify(indicesClient).close(Mockito.any(), Mockito.any());
Mockito.verifyNoMoreInteractions(indicesClient);
}

@Override
protected CloseFollowerIndexStep createRandomInstance() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

performAction(idxMeta, state, observer, originalListener);
}, originalListener::onFailure),
// TODO: what is a good timeout value for no new state received during this time?
TimeValue.timeValueHours(12));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think waiting 12 hours for a snapshot to finish is reasonable. If there is no progress on this action in that time interval, a user may want to know. so 👍

@talevy
Copy link
Contributor

talevy commented Jan 22, 2019

backport to 6.x is blocked on #37723 (SnapshotInProgressException)

@talevy
Copy link
Contributor

talevy commented Jan 23, 2019

update: above blocker PR for 6.x was merged

Copy link
Contributor

@gwbrown gwbrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dakrone dakrone merged commit 647e225 into elastic:master Jan 23, 2019
dakrone added a commit that referenced this pull request Jan 23, 2019
Some steps, such as steps that delete, close, or freeze an index, may fail due to a currently running snapshot of the index. In those cases, rather than move to the ERROR step, we should retry the step when the snapshot has completed.

This change adds an abstract step (`AsyncRetryDuringSnapshotActionStep`) that certain steps (like the ones I mentioned above) can extend that will automatically handle a situation where a snapshot is taking place. When a `SnapshotInProgressException` is received by the listener wrapper, a `ClusterStateObserver` listener is registered to wait until the snapshot has completed, re-running the ILM action when no snapshot is occurring.

This also adds integration tests for these scenarios (thanks to @talevy in #37552).

Resolves #37541
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jan 23, 2019
* master:
  Liberalize StreamOutput#writeStringList (elastic#37768)
  Add PersistentTasksClusterService::unassignPersistentTask method (elastic#37576)
  Tests: disable testRandomGeoCollectionQuery on tiny polygons (elastic#37579)
  Use ILM for Watcher history deletion (elastic#37443)
  Make sure PutMappingRequest accepts content types other than JSON. (elastic#37720)
  Retry ILM steps that fail due to SnapshotInProgressException (elastic#37624)
  Use disassociate in preference to deassociate (elastic#37704)
  Delete Redundant RoutingServiceTests (elastic#37750)
  Always return metadata version if metadata is requested (elastic#37674)
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jan 24, 2019
* elastic/master: (85 commits)
  Use explicit version for build-tools in example plugin integ tests (elastic#37792)
  Change `rational` to `saturation` in script_score (elastic#37766)
  Deprecate types in get field mapping API (elastic#37667)
  Add ability to listen to group of affix settings (elastic#37679)
  Ensure changes requests return the latest mapping version (elastic#37633)
  Make Minio Setup more Reliable (elastic#37747)
  Liberalize StreamOutput#writeStringList (elastic#37768)
  Add PersistentTasksClusterService::unassignPersistentTask method (elastic#37576)
  Tests: disable testRandomGeoCollectionQuery on tiny polygons (elastic#37579)
  Use ILM for Watcher history deletion (elastic#37443)
  Make sure PutMappingRequest accepts content types other than JSON. (elastic#37720)
  Retry ILM steps that fail due to SnapshotInProgressException (elastic#37624)
  Use disassociate in preference to deassociate (elastic#37704)
  Delete Redundant RoutingServiceTests (elastic#37750)
  Always return metadata version if metadata is requested (elastic#37674)
  [TEST] Mute MlMappingsUpgradeIT testMappingsUpgrade
  Streamline skip_unavailable handling (elastic#37672)
  Only bootstrap and elect node in current voting configuration (elastic#37712)
  Ensure either success or failure path for SearchOperationListener is called (elastic#37467)
  Target only specific index in update settings test
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ILM Actions can fail due to in-progress snapshots of indices
5 participants