Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an explicit CLEANUP step to migrations v2 #100650

Closed
mshustov opened this issue May 26, 2021 · 6 comments
Closed

Add an explicit CLEANUP step to migrations v2 #100650

mshustov opened this issue May 26, 2021 · 6 comments
Labels
project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@mshustov
Copy link
Contributor

Migrations v2 haven't got an explicit CLEANUP stage. It calls cleanup function when a migration fails due to an exception or a transition to FATAL state.

await cleanup(client, executionLog, finalState);
dumpExecutionLog(logger, logMessagePrefix, executionLog);
return Promise.reject(
new Error(
`Unable to complete saved object migrations for the [${initialState.indexPrefix}] index: ` +
finalState.reason
)
);
} else {
throw new Error('Invalid terminating control state');
}
} catch (e) {
await cleanup(client, executionLog, lastState);

We are going to extend cleanup logic with additional async sub-steps (#100631). Therefore, we should be explicit about CLEANUP stage and make a transition to it explicit in our migration flow.
See #97222 (comment) for some implementation ideas.

@mshustov mshustov added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient labels May 26, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@TinaHeiligers
Copy link
Contributor

TinaHeiligers commented May 26, 2021

TBH, I’m not completely following the implementation suggestions in #97222 (comment) .
AFAICT, there are three options suggested there (I've rephrased them slightly to how I understand the suggestions):

  1. Add a CLEANUP step before FATAL (i.e replace all controlState transitions from FATAL to CLEANUP in the model). This takes care of having to manually call cleanup in migrations_state_action_machine's try block when the controlState is FATAL. We still haven't handled the need to manually execute cleanup and add the cleanup type to the execution log in the migrations_state_action_machine’s catch block for unhandled exceptions.
  2. Change the behavior of stateActionMachine itself to catch errors and force a transition to FATAL when errors are thrown (not ideal because we’re splitting up the control flow handling here into more places and we'd introduce more complexity with error handling in stateActionMachine)
  3. Implement an action that throws when the controlState is FATAL rather than stopping the state machine in next. However, we'd still need to call cleanup in migrations_state_action_machine.

None of these are a silver bullet and we’d still be stuck with having the cleanup logic in migrations_state_action_machine somewhere (either within the try block when we encounter a FATAL controlState or in the catch block on handling previously unhandled exceptions).

If we’re trying to handle cleanup within the model itself and remove the explicit call to cleanup in migrations_state_action_machine then how does this sound:

  1. Implement a CLEANUP control state that ultimately returns an instance of Either.right with an appropriate response constant (TaskEither.right('cleanup_complete' as const)) that we can watch for and act on to change the controlState to FATAL (for previous conditions where we handle cleanup within the try block in migrations_state_action_machine. That then throws an unknown control state as before and returns the rejected Promise as usual. The action that returns that response might implement some sort of flow with task chaining if needed.
  2. Somehow also weave in a CLEANUP control state before throwing a bad response from within the model to replace the call to cleanup in migrations_state_action_machine's catch block.
    I'm not sure of the approach here yet, maybe it's just a different controlState (BADRESPONSE?) that executes the same action(s) as CLEANUP and eventually throws the original badResponse error?

The trouble here is we need to build the CLEANUP phase into the model in both the FATAL and badResponse cases in order to remove it from the migrations_state_action_machine. ATM, all the cleanup phase does is close any open pit but I'm not sure how flexible we need to make it.

@mshustov
Copy link
Contributor Author

  1. Change the behavior of stateActionMachine itself to catch errors and force a transition to FATAL when errors are thrown (not ideal because we’re splitting up the control flow handling here into more places and we'd introduce more complexity with error handling in stateActionMachine)

not ideal, indeed, but it catches exceptions thrown during action execution as well https://github.com/elastic/kibana/blob/master/src/core/server/saved_objects/migrationsv2/state_action_machine.ts#L70

  1. Implement an action that throws when the controlState is FATAL rather than stopping the state machine in next. However, we'd still need to call cleanup in migrations_state_action_machine

I'm not a fan of using exceptions for control flow purposes. It's only a matter of time before we run into their limitations. And as you said, we still have to keep cleanup in migrations_state_action_machine.

to replace the call to cleanup in migrations_state_action_machine's catch block.

We will need to call cleanup for errors happening during action executions. Won't we?

I start thinking that the current implementation is okay-ish and it's not worth refactoring it until we have to. @pgayvallet

@pgayvallet
Copy link
Contributor

What I would have tried to do is to move the error handling logic one level lower, in the stateActionMachine implementation itself, to natively try/catch the execution flow and to re-dispatch to fatal/cleanup execution when an error occurs during an action or during a model changes.

The main goal would be to remove the duplication we got in migrationStateActionMachine between the finalState.controlState === 'FATAL' block and the catch block.

However, after thinking about the implications, I agree that this is not bullet-proof, as if any error leads to an action/state redispatch, we could just fall into a deadloop if for some reason FATAL and/or CLEANUP throws an error (that would loop FATAL->FATAL indefinitely) and would need additional state logic to ensure we're not going that (this also means we would still need a global catch-all block higher, which just breaks the whole goal of the refactoring). Also the implementation itself would probably more complex than what we got atm.

I start thinking that the current implementation is okay-ish

Yea, seems like trying to improve that part is not worth the cost. The duplication is only a few lines, and is contained inside the same file, so it's not that significant.

@TinaHeiligers
Copy link
Contributor

TinaHeiligers commented May 28, 2021

I start thinking that the current implementation is okay-ish and it's not worth refactoring it until we have to.

Um, ok, This is worth discussing further as a team then.

I've gotten fairly far with the approach I outline above, intercepting all transitions to FATAL with a CLEANUP_FATAL transition and all calls to throwBadResponse with CLEANUP_BAD_RESPONSE for state transitions where we have a pit (or at least expect one to exist). For the latter refactor, it's not a blanket change of all the calls to throwBadResponse because the only cleanup action we do right now is close the pit.

It's not that much work and I could finish it up, understanding that it might be thrown away.

FWIW, my approach should have the same effect as moving the error handling to stateActionMachine that returns the finalState

Please let me know what you think, otherwise we're wasting time here.
For now, I'll bench this and we can revisit later if we need to handle more cleanup steps.
cc @joshdover

@TinaHeiligers
Copy link
Contributor

The additional cleanup step that was going to be added cannot be done safely and for the reasons mentioned above, there isn't a need to add an explicit CLEANUP step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
4 participants