Implement exponential backoff retry mechanism for transport tasks #1837

sphuber · 2018-08-01T17:40:56Z

JobProcesses have various tasks the need to execute that require
a transport, which can then fail for various reasons due to the
command executed over the transport excepting. Examples are the
submission of a job calculation as well as updating its scheduler
state. These may fail for reasons that do not necessarily mean that
the job is irrecoverably lost, such as the internet connection being
temporarily unavailable or the scheduler simply not responding.
Instead of putting the process in an excepted state, the engine
should automatically retry at a later stage.

Here we implement the exponential_backoff_retry utility, which is a
coroutine that can wrap another function or coroutine and will try
to run it, and rerun it when an exception is caught. When an
exception is caught as many times as the maximum number of allowed
attempts, the exception is re-raised.

This is implemented in the various transport tasks that are called
by the Waiting state of the JobProcess class:

task_submit_job: submit the calculation
task_update_job: update the scheduler state
task_retrieve_job: retrieve the files of the completed calc
task_kill_job: kill the job through the scheduler

These are now wrapped in the exponential_backoff_retry coroutine,
which will give the process some leeway when they fail for reasons
that may often resolve themselves, when given the time.

codecov-io · 2018-08-02T08:42:04Z

Codecov Report

Merging #1837 into develop will increase coverage by 0.03%.
The diff coverage is 10%.

@@             Coverage Diff             @@
##           develop    #1837      +/-   ##
===========================================
+ Coverage    66.69%   66.73%   +0.03%     
===========================================
  Files          317      317              
  Lines        32407    32406       -1     
===========================================
+ Hits         21613    21625      +12     
+ Misses       10794    10781      -13

Impacted Files	Coverage Δ
aiida/transport/plugins/local.py	`81.21% <100%> (ø)`	⬆️
aiida/orm/implementation/sqlalchemy/group.py	`87.62% <100%> (+0.06%)`	⬆️
aiida/daemon/execmanager.py	`8.6% <5.26%> (+0.88%)`	⬆️
aiida/backends/djsite/db/models.py	`76.23% <0%> (+0.88%)`	⬆️
aiida/backends/djsite/globalsettings.py	`86.84% <0%> (+5.26%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db14d57...3bd18b4. Read the comment docs.

JobProcesses have various tasks the need to execute that require a transport, which can then fail for various reasons due to the command executed over the transport excepting. Examples are the submission of a job calculation as well as updating its scheduler state. These may fail for reasons that do not necessarily mean that the job is unrecoverably lost, such as the internet connection being temporarily unavailable or the scheduler simply not responding. Instead of putting the process in an excepted state, the engine should automatically retry at a later stage. Here we implement the exponential_backoff_retry utility, which is a coroutine that can wrap another function or coroutine and will try to run it, and rerun it when an exception is caught. When an exception is caught as many times as the maximum number of allowed attempts, the exception is reraised. This is implemented in the various transport tasks that are called by the Waiting state of the JobProcess class: * task_submit_job: submit the calculation * task_update_job: update the scheduler state * task_retrieve_job: retrieve the files of the completed calc * task_kill_job: kill the job through the scheduler These are now wrapped in the exponential_backoff_retry coroutine, which will give the process some leeway when they fail for reasons that may often resolve themselves, when given the time.

muhrin

Very nice!

sphuber requested a review from muhrin August 1, 2018 17:40

sphuber force-pushed the fix_1834_exponential_backoff_retry_transport_task branch from 2a0713e to 6ac3fb5 Compare August 2, 2018 08:26

sphuber force-pushed the fix_1834_exponential_backoff_retry_transport_task branch from 6ac3fb5 to 3bd18b4 Compare August 2, 2018 09:52

muhrin approved these changes Aug 2, 2018

View reviewed changes

sphuber merged commit 5ed5f6e into aiidateam:develop Aug 2, 2018

sphuber deleted the fix_1834_exponential_backoff_retry_transport_task branch August 2, 2018 12:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement exponential backoff retry mechanism for transport tasks #1837

Implement exponential backoff retry mechanism for transport tasks #1837

sphuber commented Aug 1, 2018

codecov-io commented Aug 2, 2018 •

edited

Loading

muhrin left a comment

Implement exponential backoff retry mechanism for transport tasks #1837

Implement exponential backoff retry mechanism for transport tasks #1837

Conversation

sphuber commented Aug 1, 2018

codecov-io commented Aug 2, 2018 • edited Loading

Codecov Report

muhrin left a comment

Choose a reason for hiding this comment

codecov-io commented Aug 2, 2018 •

edited

Loading