Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reliability - Critical - Configurable timeout for TransferProcess #416

Closed
marcgs opened this issue Dec 14, 2021 · 4 comments
Closed

Reliability - Critical - Configurable timeout for TransferProcess #416

marcgs opened this issue Dec 14, 2021 · 4 comments
Assignees
Milestone

Comments

@marcgs
Copy link
Contributor

marcgs commented Dec 14, 2021

TransferProcesses might get stuck in a state for a long time, or even indefinitely when errors occur. The TransferProcesManager should monitor this situation and react to it after a configurable threshold by moving the process to an ERROR state, effectively taking the TransferProcess out of the state machine processing loop.

The threshold should be configurable on a TransferProcess basis, as appropriate thresholds may diverge vastly depending on the nature of the transfer itself (moving few KB VS several GB). A sensible default value should be used in case no configuration is available for a given TransferProcess.

@marcgs
Copy link
Contributor Author

marcgs commented Dec 14, 2021

For reference, here we implemented a watchdog process that cancels long running TransferProcesses. Probably it is easier to solve this issue directly in the main state machine loop though.

@ndr-brt
Copy link
Member

ndr-brt commented Jan 20, 2022

@juliapampus this can be the issue that could make an use of the stateCount field.

We should check, after an error on state transition, how many times that happened and, over a certain threshold (5? 10? configurable?) the TransferProcess should be cancelled.
The other option is to add another step in the main loop that looks for the TransferProcess where stateCount is bigger than the threshold and cancel them.
Not sure about the latter approach, I'm feeling that we're overwhelming that loop, and this is degrading performances.

This behavior should be valid also for the other state machines (provider/consumer contract negotiation)

@juliapampus juliapampus modified the milestones: Milestone 2, Milestone Scoping Feb 23, 2022
@juliapampus juliapampus modified the milestones: Milestone Scoping, Milestone 2 Mar 16, 2022
@juliapampus
Copy link
Contributor

Seems to be closed by #710.

@ndr-brt
Copy link
Member

ndr-brt commented Mar 16, 2022

@juliapampus probably not, but this issue will become obsolete since the 2-state transitions will be applied on all the state machines (#831 #870 ), because there will be no "staled states" anymore as everyone will have it's "processor" on state machine (apart from "final" states).
So I'm ok to close this anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants