-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restarting the daemon excepts all jobs (aiida-core 1.6, python 3.7) #4648
Comments
thanks for the report! I encounter the timeout problem but not the process except. In you case, will the issue reproduce if you exchange the step 1 and 2, means you submit the process after daemon start? I have a feeling this issue is related to the code of closing all running Tasks when calling daemon stop. I?ll check it next week. On Sat, Jan 9, 2021 at 23:31 Giovanni Pizzi <notifications@github.com> wrote:
Steps to reproduce
Steps to reproduce the behavior:
Submit a job to the daemonStart the daemon, if not already runningStop or restart the daemon
Expected behavior
The daemon stops in a reasonably short time, and the job is "frozen" and will safely continue when the daemon restarts.
Actual problematic behaviour
Instead, when stopping and/or restarting the deamon, I get a TIMEOUT.
Then, I get things like this from verdi process list:
PK Created Process label Process State Process status
…---- --------- --------------- --------------- -------------------------------------
184 20h ago PwCalculation ? Excepted Transport task update was interrupted
189 1h ago PwCalculation ? Excepted Transport task submit was interrupted
and verdi process report shows this:
$ verdi process report 184
*** 184: None
*** Scheduler output: N/A
*** Scheduler errors: N/A
*** 3 LOG MESSAGES:
+-> ERROR at 2021-01-08 18:53:56.557646+00:00
| Traceback (most recent call last):
| File "/home/pizzi/git/aiida-core/aiida/engine/utils.py", line 171, in exponential_backoff_retry
| result = await coro()
| File "/home/pizzi/git/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 177, in do_update
| job_info = await cancellable.with_interrupt(update_request)
| File "/home/pizzi/git/aiida-core/aiida/engine/utils.py", line 87, in with_interrupt
| result = await next(wait_iter)
| File "/usr/lib/python3.7/asyncio/tasks.py", line 560, in _wait_for_one
| return f.result() # May raise f.exception().
| concurrent.futures._base.CancelledError
+-> REPORT at 2021-01-08 18:53:36.744091+00:00
| [184|PwCalculation|on_except]: Traceback (most recent call last):
| File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/processes.py", line 1072, in step
| next_state = await self._run_task(self._state.execute)
| File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/processes.py", line 498, in _run_task
| result = await coro(*args, **kwargs)
| File "/home/pizzi/git/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 358, in execute
| job_done = await self._launch_task(task_update_job, node, self.process.runner.job_manager)
| File "/home/pizzi/git/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 394, in _launch_task
| result = await self._task
| File "/usr/lib/python3.7/asyncio/tasks.py", line 318, in __wakeup
| future.result()
| concurrent.futures._base.CancelledError
+-> ERROR at 2021-01-08 18:53:36.535279+00:00
| Traceback (most recent call last):
| File "/home/pizzi/git/aiida-core/aiida/engine/utils.py", line 171, in exponential_backoff_retry
| result = await coro()
| File "/home/pizzi/git/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 177, in do_update
| job_info = await cancellable.with_interrupt(update_request)
| File "/home/pizzi/git/aiida-core/aiida/engine/utils.py", line 87, in with_interrupt
| result = await next(wait_iter)
| File "/usr/lib/python3.7/asyncio/tasks.py", line 556, in _wait_for_one
| f = await done.get()
| File "/usr/lib/python3.7/asyncio/queues.py", line 159, in get
| await getter
| File "/usr/lib/python3.7/asyncio/futures.py", line 263, in __await__
| yield self # This tells Task to wait for completion.
| File "/usr/lib/python3.7/asyncio/tasks.py", line 318, in __wakeup
| future.result()
| File "/usr/lib/python3.7/asyncio/futures.py", line 176, in result
| raise CancelledError
| concurrent.futures._base.CancelledError
Further notes
I think this is actually the main problem that also manifested itself in #4595 and #4345
I think that many interruptions that should be "safe" are instead considered exceptions (here in the specific case of stopping the daemon, but also in other cases like SSH errors like maybe in #4345).
@muhrin @sphuber @unkcpz @chrisjsewell
?You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or unsubscribe.
|
Hi @unkcpz - yes, the calculation crashed in my case also if it the daemon was already running. If you run the daemon after submitting, instead, it picks up correctly the job. |
hi @giovannipizzi I cannot reproduce this with a more simple example like:
I think it is all the same right? Moreover, I have run my sssp workflow in latest develop branch which also have |
Could you maybe try with an SSH computer/transport, where things are slower and there are non-zero safe intervals etc.? Maybe on a local transport they are all set to zero and the error occurs there. In the meantime, I'll probably need to retry with the most recent version of |
I've tried to reproduce this error as well, but don't seem to run into any issues. However, trying to restart the daemon, the operation times out. Here are the steps I execute:
$ verdi process list
PK Created Process label Process State Process status
---- --------- ---------------- --------------- ---------------------------------------
1775 11s ago PwRelaxWorkChain ⏵ Waiting Waiting for child processes: 1778
1778 10s ago PwBaseWorkChain ⏵ Waiting Waiting for child processes: 1783
1783 9s ago PwCalculation ⏵ Waiting Waiting for transport task: upload
$ verdi daemon restart --reset
Profile: dev2
Waiting for the daemon to shut down... TIMEOUT
Starting the daemon... TIMEOUT So here the connection to the daemon times out. However, it is stopped: $ verdi status
✔ config dir: /home/mbercx/.virtualenvs/aiida-dev/.aiida
✔ profile: On profile dev2
✔ repository: /home/mbercx/.virtualenvs/aiida-dev/.aiida/repository/dev2
✔ postgres: Connected as mbercx@localhost:5432
✔ rabbitmq: Connected as amqp://guest:guest@127.0.0.1:5672?heartbeat=600
⏺ daemon: The daemon is not running At this point the
$ verdi daemon start
Starting the daemon... RUNNING After waiting a bit, the daemon picks up the $ verdi process list
PK Created Process label Process State Process status
---- --------- ---------------- --------------- ---------------------------------------
1775 13m ago PwRelaxWorkChain ⏵ Waiting Waiting for child processes: 1778
1778 13m ago PwBaseWorkChain ⏵ Waiting Waiting for child processes: 1783
1783 13m ago PwCalculation ⏵ Waiting Monitoring scheduler: job state RUNNING The log also looks fine: $ verdi process report 1783
*** 1783: CalcJobState.WITHSCHEDULER, scheduler state: JobState.RUNNING
*** Scheduler output: N/A
*** Scheduler errors: N/A
*** 0 LOG MESSAGES EDIT: I tried to reproduce the issue in Quantum Mobile, but no timeout there. |
@mbercx thanks for the report! I have the same timeout issue but just cannot assure the issue comes from the new version, thanks for compare and check!
It doesn't seem necessary for now, I working on it. I think they have the same problem root, let's see if solve this one will fix the other ;) |
@ramirezfranciscof was having similar issues, so he opened #4667 where you can find some more information on the daemon log (the error there is the same for both of us). If the problems do have the same root, you can now fix two issues in one PR! 😅 |
I was able to reproduce the issue when increasing the $ verdi computer configure local localhost --safe-interval 10 -n Then, stop the daemon and run the simple example you showed previously: from aiida.engine import submit
ArithmeticAddCalculation = CalculationFactory('arithmetic.add')
inputs = {
'x': Int(1),
'y': Int(2),
'code': load_code('add@localhost'),
}
submit(ArithmeticAddCalculation, **inputs) Next, start the daemon, sleep for 2 seconds and restart it: $ verdi daemon start; sleep 2; verdi daemon restart --reset
Starting the daemon... RUNNING
Profile: generic
Waiting for the daemon to shut down... OK
Starting the daemon... RUNNING
(dev-test) max@7e5d5794c8bd:~/codes/aiida-core$ verdi process list -ap1
PK Created Process label Process State Process status
---- --------- ------------------------ --------------- -------------------------------------
189 38s ago ArithmeticAddCalculation ⨯ Excepted Transport task upload was interrupted
Total results: 1
Info: last time an entry changed state: 8s ago (at 11:27:25 on 2021-01-21) Now the calcjob has been
Moreover, very funky stuff is happening after this. It tells me the daemon is not running (without me stopping it): $ verdi daemon status
Profile: generic
The daemon is not running But it actually is! When submitting another $ verdi process list -ap1
PK Created Process label Process State Process status
---- --------- ------------------------ --------------- -------------------------------------
189 5m ago ArithmeticAddCalculation ⨯ Excepted Transport task upload was interrupted
192 6s ago ArithmeticAddCalculation ⏵ Waiting Waiting for transport task: upload
Total results: 2
Info: last time an entry changed state: 6s ago (at 11:32:28 on 2021-01-21)
Warning: the daemon is not running
(dev-test) max@7e5d5794c8bd:~/.aiida/daemon/log$ verdi process list -ap1
PK Created Process label Process State Process status
---- --------- ------------------------ --------------- -------------------------------------
189 5m ago ArithmeticAddCalculation ⨯ Excepted Transport task upload was interrupted
192 13s ago ArithmeticAddCalculation ⏵ Waiting Waiting for transport task: submit
Total results: 2
Info: last time an entry changed state: 2s ago (at 11:32:38 on 2021-01-21)
Warning: the daemon is not running Finally, note that here restarting the daemon did not encounter any |
Note: the You can also get |
@giovannipizzi/@mbercx I cannot reproduce the issue in https://github.com/aiidateam/aiida-integration-tests: $ verdi daemon stop
$ aiida-sleep calc -n 2 --submit -p 100000
$ verdi daemon start
$ verdi daemon stop
$ verdi process list
1776 1m ago SleepCalculation ⏵ Waiting Waiting for transport task: upload
1779 58s ago SleepCalculation ⏵ Waiting Waiting for transport task: upload
$ verdi daemon start
$ verdi daemon stop
$ verdi process list
1776 1m ago SleepCalculation ⏵ Waiting Waiting for transport task: submit
1779 1m ago SleepCalculation ⏵ Waiting Waiting for transport task: submit I feel it may be related to #4692, but then again I couldn't reproduce it before this commit either As per #4667, can you confirm that you were not using python 3.6 Then can you try to reproduce on https://github.com/aiidateam/aiida-integration-tests. |
yeh and just to double-check added the $ verdi process list -ap1
PK Created Process label Process State Process status
---- --------- ------------------------ --------------- -------------------------------------
8517 42s ago ArithmeticAddCalculation ⏵ Waiting Waiting for transport task: upload so @mbercx, yeh give it a go again with the latest |
Not sure if the same issue or if needs a new issue, but I think I am seeing a similar problem using A number of jobs seem to not perform the retrieve task correctly if a daemon restart occurs:
Checking on the remote machine, the files corresponding to the error aren't missing so it implies an issue with the transport rather than the calculation itself. Different behaviour in that it isn't excepting but instead just failing to retrieve/parse some files. |
@mjclarke94 good to hear you are using aiida-lammps 😄 what do you get if you do |
@chrisjsewell Haha, that and
However the file definitely is there:
Assuming I've hit on the relevant bit of the log:
So lots of messages relating to jobs already being marked as parsing and then others failing to parse. |
😆 so I don't think the @sphuber would you expect to see such warnings in the log? |
The parser checks the contents of the |
No trajectory file in either a |
ah ok thanks @mjclarke94
@sphuber perhaps then this safety mechanism does not account for temporary retrieved files? |
Ah, that must mean the plugin uses the retrieved temporary feature such that the trajectory can be parsed and stored, but the original file is then discarded. This makes sense since these files typically get big and you don't want the content duplicated. Do the |
It should, but since he is on |
@sphuber I haven't seen the issue outside of a daemon restart, but might be conflating it with another issue. Here's a less compact
|
@sphuber it seems here: aiida-core/aiida/engine/processes/calcjobs/tasks.py Lines 390 to 395 in c07e3ef
the temp_folder is created fresh everytime you reach this part of the code. So if the daemon worker is stopped during the parsing, and a new daemon worker picks up the task when the daemon is restarted, it will get a new temp_folder (and also the old temp_folder will not be removed). so perhaps the temp_folder location should be stored on the node, e.g. something like temp_folder = node.get_retrieved_temp_folder()
if not temp_folder or not os.path.exists(temp_folder):
temp_folder = tempfile.mkdtemp()
node.set_retrieved_temp_folder(temp_folder) |
This definitely used to work at some point and without storing the temporary folder on the node. Instead after the restart it would generate a new temp folder and retrieve again. This is because you cannot guarantee that after the restart the original temp folder still exists so you have to accept the inefficiency of retrieving again. Could you please split this off in a new issue, since it has a different cause than the OP. I am convinced this used to work in previous versions, so I was trying to find the commit that introduced it, but haven't found it so far. Cannot look too much further into it now though unfortunately. |
nah you're just trying to find some else to blame 😜
indeed |
Hey, I didn't say it was the most recent version that broke it nor that the commit I was looking for wouldn't turn out to be mine ;) |
Looking at the changes in 3.7 -> 3.8, one change that seems like it has the possibility to cause this is: https://docs.python.org/3/whatsnew/3.8.html#asyncio
perhaps somewhere a |
Thanks @chrisjsewell , this does seem likely. I'm looking at this code in aiida-core that is responsible for the exception under python 3.7.9
The As mentioned here, as of python 3.8 this is a mistake - it should be using However, I fear that "fixing" this will actually make the exception happen in python 3.8 as well ;-) |
so if you add |
in that same file ( except (plumpy.futures.CancelledError, plumpy.process_states.Interruption): # pylint: disable=try-except-raise
raise
except Exception: which again would appear to have divergent behaviour between python 3.7 and 3.8 |
I've replaced every instance of Weirdly, though, this does not seem to lead to the exception in python 3.8 - the job still goes through fine. |
perhaps now replace all |
I checked; this also doesn't change anything. |
Just to clarify, searching for aiida: 9 results - 1 file
aiida/engine/processes/calcjobs/tasks.py:
92 try:
93 logger.info(f'scheduled request to upload CalcJob<{node.pk}>')
94: ignore_exceptions = (plumpy.futures.CancelledError, PreSubmitException, plumpy.process_states.Interruption)
95 skip_submit = await exponential_backoff_retry(
96 do_upload, initial_interval, max_attempts, logger=node.logger, ignore_exceptions=ignore_exceptions
98 except PreSubmitException:
99 raise
100: except (plumpy.futures.CancelledError, plumpy.process_states.Interruption):
101 raise
102 except Exception:
140 try:
141 logger.info(f'scheduled request to submit CalcJob<{node.pk}>')
142: ignore_exceptions = (plumpy.futures.CancelledError, plumpy.process_states.Interruption)
143 result = await exponential_backoff_retry(
144 do_submit, initial_interval, max_attempts, logger=node.logger, ignore_exceptions=ignore_exceptions
145 )
146: except (plumpy.futures.CancelledError, plumpy.process_states.Interruption): # pylint: disable=try-except-raise
147 raise
148 except Exception:
199 try:
200 logger.info(f'scheduled request to update CalcJob<{node.pk}>')
201: ignore_exceptions = (plumpy.futures.CancelledError, plumpy.process_states.Interruption)
202 job_done = await exponential_backoff_retry(
203 do_update, initial_interval, max_attempts, logger=node.logger, ignore_exceptions=ignore_exceptions
204 )
205: except (plumpy.futures.CancelledError, plumpy.process_states.Interruption): # pylint: disable=try-except-raise
206 raise
207 except Exception:
265 try:
266 logger.info(f'scheduled request to retrieve CalcJob<{node.pk}>')
267: ignore_exceptions = (plumpy.futures.CancelledError, plumpy.process_states.Interruption)
268 result = await exponential_backoff_retry(
269 do_retrieve, initial_interval, max_attempts, logger=node.logger, ignore_exceptions=ignore_exceptions
270 )
271: except (plumpy.futures.CancelledError, plumpy.process_states.Interruption): # pylint: disable=try-except-raise
272 raise
273 except Exception:
407 logger.warning(f'killed CalcJob<{node.pk}> but async future was None')
408 raise
409: except (plumpy.process_states.Interruption, plumpy.futures.CancelledError):
410 node.set_process_status(f'Transport task {command} was interrupted')
411 raise plumpy: 3 results - 1 file
plumpy/futures.py:
8 import kiwipy
9
10: __all__ = ['Future', 'gather', 'chain', 'copy_future', 'CancelledError', 'create_task']
11
12: CancelledError = kiwipy.CancelledError
13
14 kiwipy: 6 results - 4 files
kiwipy/futures.py:
4 import logging
5
6: __all__ = 'Future', 'chain', 'copy_future', 'CancelledError', 'capture_exceptions', 'wait', 'as_completed'
7
8 _LOGGER = logging.getLogger(__name__)
9
10: CancelledError = concurrent.futures.CancelledError
11 wait = concurrent.futures.wait # pylint: disable=invalid-name
12 as_completed = concurrent.futures.as_completed # pylint: disable=invalid-name
kiwipy/rmq/communicator.py:
252 await self._send_response(reply_to, correlation_id, utils.pending_response())
253 future = await future
254: except kiwipy.CancelledError as exc:
255 # Send out a cancelled response
256 await self._send_response(reply_to, correlation_id, utils.cancelled_response(str(exc)))
kiwipy/rmq/tasks.py:
168 # Task was rejected by this subscriber, keep trying
169 continue
170: except kiwipy.CancelledError:
171 # The subscriber has cancelled their processing of the task
172 outcome.cancel()
kiwipy/rmq/threadcomms.py:
259 try:
260 result = kiwi_future.result()
261: except concurrent.futures.CancelledError:
262 self._loop.call_soon_threadsafe(aio_future.cancel)
263 except Exception as exc: # pylint: disable=broad-except |
Another potential place, is in def interruptable_task(
coro: Callable[[InterruptableFuture], Awaitable[Any]],
loop: Optional[asyncio.AbstractEventLoop] = None
) -> InterruptableFuture:
"""
Turn the given coroutine into an interruptable task by turning it into an InterruptableFuture and returning it.
:param coro: the coroutine that should be made interruptable with object of InterutableFuture as last paramenter
:param loop: the event loop in which to run the coroutine, by default uses asyncio.get_event_loop()
:return: an InterruptableFuture
"""
loop = loop or asyncio.get_event_loop()
future = InterruptableFuture()
async def execute_coroutine():
"""Coroutine that wraps the original coroutine and sets it result on the future only if not already set."""
try:
result = await coro(future)
except Exception as exception: # pylint: disable=broad-except
if not future.done():
future.set_exception(exception)
else:
LOGGER.warning(
'Interruptable future set to %s before its coro %s is done. %s', future.result(), coro.__name__,
str(exception)
)
else:
# If the future has not been set elsewhere, i.e. by the interrupt call, by the time that the coroutine
# is executed, set the future's result to the result of the coroutine
if not future.done():
future.set_result(result)
loop.create_task(execute_coroutine())
return future whereby an |
I've checked, and adding the I'm wondering whether the change in behavior of the CancelledError is perhaps a red herring... and whether we perhaps rather need to look into
I found the following in the python 3.7.10 changelog
but I tried python 3.7.10 and it does not fix the issue. |
stop the clock, I have a diagnosis + fix... |
ok so (deep breath)
In python 3.8 daemon log:
In python 3.7 daemon log:
From this (and the fix) I'm 90% sure that Task.cancel excepts with We cancel all the tasks before closing the daemon here: aiida-core/aiida/engine/daemon/runner.py Lines 31 to 32 in 976876f
For aiida-core/aiida/engine/processes/calcjobs/tasks.py Lines 409 to 411 in 976876f
and so the process status is not changed from "Waiting for transport task: upload" For except asyncio.CancelledError:
raise aiida-core/aiida/engine/transports.py Line 111 in 976876f
aiida-core/aiida/manage/external/rmq.py Line 212 in 976876f
|
Well, as mentioned in the python 3.8 changelog that you linked, in python3.7 python/cpython@431b540#diff-5db3920ed2c2a4091ac1f9272983fbfd41df66f0cee777f8715b354102f64e36 I.e. the As for the exception handling, the interesting question is now: which of the |
thats not correct though, the
Oh yes thats what I'm doing now, #4648 (comment) details exactly what needs to be changed |
Well, slightly before the change linked above you have this: I.e. at that point, |
Ah fair, so:
is pretty misleading then, because it does not inherit from |
but yeh I will be making PRs shortly 👍 |
well in In |
cc @muhrin for comment |
First part of the fix: aiidateam/plumpy#214 |
Thanks Chris. I'd have to a look a bit more deeply but |
Edit: this issue only occurs in python 3.7, so unitl the issue is fuly resolved, a "fix" is to upgrade to python 3.8
Steps to reproduce
Steps to reproduce the behavior:
Expected behavior
The daemon stops in a reasonably short time, and the job is "frozen" and will safely continue when the daemon restarts.
Actual problematic behaviour
Instead, when stopping and/or restarting the deamon, I get a
TIMEOUT
.Then, I get things like this from
verdi process list
:and
verdi process report
shows this:Further notes
I think this is actually the main problem that also manifested itself in #4595 and #4345
I think that many interruptions that should be "safe" are instead considered exceptions (here in the specific case of stopping the daemon, but also in other cases like SSH errors like maybe in #4345).
@muhrin @sphuber @unkcpz @chrisjsewell
The text was updated successfully, but these errors were encountered: