Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pause JobProcess when transport task falls through exponential backoff #1903

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Aug 23, 2018

Fixes #1835

All transport tasks for the JobProcess are wrapped in the exponential
backoff retry coroutine utility, which when an exception occurs during
the transport task, will reschedule the task with an exponential backoff.
However, the backoff has a maximum number of retries, which when hit would
bubble up the exception and causing the process to except. With the new
pausing functionality in place, instead we can catch the
TransportTaskException and pause the process. The user then has the chance
to investigate the logs to determine the problem. If the problem was just
of a temporary nature, the user can then resume the process. If instead
the failure was of unrecoverable nature, the user can always decide to kill
the process.

@sphuber sphuber requested a review from muhrin August 23, 2018 14:18
@sphuber sphuber force-pushed the fix_1835_pause_job_process_excepted_transport_task branch from 39d6703 to 03c56fb Compare August 23, 2018 14:19
Copy link
Contributor

@muhrin muhrin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sound as a (pre-Brexit) pound

All transport tasks for the `JobProcess` are wrapped in the exponential
backoff retry coroutine utility, which when an exception occurs during
the transport task, will reschedule the task with an exponential backoff.
However, the backoff has a maximum number of retries, which when hit would
bubble up the exception and causing the process to except. With the new
pausing functionality in place, instead we can catch the
`TransportTaskException` and pause the process. The user then has the chance
to investigate the logs to determine the problem. If the problem was just
of a temporary nature, the user can then resume the process. If instead
the failure was of unrecoverable nature, the user can always decide to kill
the process.
@sphuber sphuber force-pushed the fix_1835_pause_job_process_excepted_transport_task branch from 03c56fb to 10ffdd6 Compare August 23, 2018 14:33
@sphuber sphuber merged commit 57c668b into aiidateam:develop Aug 23, 2018
@sphuber sphuber deleted the fix_1835_pause_job_process_excepted_transport_task branch August 23, 2018 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants