Implement exponential backoff retry mechanism for transport tasks (#1837

) JobProcesses have various tasks the need to execute that require a transport, which can then fail for various reasons due to the command executed over the transport excepting. Examples are the submission of a job calculation as well as updating its scheduler state. These may fail for reasons that do not necessarily mean that the job is unrecoverably lost, such as the internet connection being temporarily unavailable or the scheduler simply not responding. Instead of putting the process in an excepted state, the engine should automatically retry at a later stage. Here we implement the exponential_backoff_retry utility, which is a coroutine that can wrap another function or coroutine and will try to run it, and rerun it when an exception is caught. When an exception is caught as many times as the maximum number of allowed attempts, the exception is reraised. This is implemented in the various transport tasks that are called by the Waiting state of the JobProcess class: * task_submit_job: submit the calculation * task_update_job: update the scheduler state * task_retrieve_job: retrieve the files of the completed calc * task_kill_job: kill the job through the scheduler These are now wrapped in the exponential_backoff_retry coroutine, which will give the process some leeway when they fail for reasons that may often resolve themselves, when given the time.
aiidateam · Aug 2, 2018 · 5ed5f6e · 5ed5f6e
1 parent db14d57
commit 5ed5f6e
Show file tree

Hide file tree

Showing 8 changed files with 565 additions and 447 deletions.
diff --git a/aiida/backends/tests/__init__.py b/aiida/backends/tests/__init__.py
@@ -93,7 +93,7 @@
         'work.run': ['aiida.backends.tests.work.run'],
         'work.runners': ['aiida.backends.tests.work.test_runners'],
         'work.test_transport': ['aiida.backends.tests.work.test_transport'],
-        'work.utils': ['aiida.backends.tests.work.utils'],
+        'work.utils': ['aiida.backends.tests.work.test_utils'],
         'work.work_chain': ['aiida.backends.tests.work.work_chain'],
         'work.workfunctions': ['aiida.backends.tests.work.test_workfunctions'],
         'work.job_processes': ['aiida.backends.tests.work.job_processes'],

diff --git a/aiida/backends/tests/work/test_utils.py b/aiida/backends/tests/work/test_utils.py
@@ -0,0 +1,55 @@
+# -*- coding: utf-8 -*-
+from tornado.ioloop import IOLoop
+from tornado.gen import coroutine
+
+from aiida.backends.testbase import AiidaTestCase
+from aiida.work.utils import exponential_backoff_retry
+
+ITERATION = 0
+MAX_ITERATIONS = 3
+
+
+class TestExponentialBackoffRetry(AiidaTestCase):
+    """Tests for the exponential backoff retry coroutine."""
+
+    @classmethod
+    def setUpClass(cls, *args, **kwargs):
+        """Set up a simple authinfo and for later use."""
+        super(TestExponentialBackoffRetry, cls).setUpClass(*args, **kwargs)
+        cls.authinfo = cls.backend.authinfos.create(
+            computer=cls.computer,
+            user=cls.backend.users.get_automatic_user())
+        cls.authinfo.store()
+
+    def test_exponential_backoff_success(self):
+        """Test that exponential backoff will successfully catch exceptions as long as max_attempts is not exceeded."""
+        ITERATION = 0
+        loop = IOLoop()
+
+        @coroutine
+        def coro():
+            """A function that will raise RuntimeError as long as ITERATION is smaller than MAX_ITERATIONS."""
+            global ITERATION
+            ITERATION += 1
+            if ITERATION < MAX_ITERATIONS:
+                raise RuntimeError
+
+        max_attempts = MAX_ITERATIONS + 1
+        loop.run_sync(lambda: exponential_backoff_retry(coro, initial_interval=0.1, max_attempts=max_attempts))
+
+    def test_exponential_backoff_max_attempts_exceeded(self):
+        """Test that exponential backoff will finally raise if max_attempts is exceeded"""
+        ITERATION = 0
+        loop = IOLoop()
+
+        @coroutine
+        def coro():
+            """A function that will raise RuntimeError as long as ITERATION is smaller than MAX_ITERATIONS."""
+            global ITERATION
+            ITERATION += 1
+            if ITERATION < MAX_ITERATIONS:
+                raise RuntimeError
+
+        max_attempts = MAX_ITERATIONS - 1
+        with self.assertRaises(RuntimeError):
+            loop.run_sync(lambda: exponential_backoff_retry(coro, initial_interval=0.1, max_attempts=max_attempts))