-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AiiDA scheduler plugin thinks job is finished while it is still running #4326
Comments
Following a brief discussion with @giovannipizzi and @sphuber , here follows some inspection of one of the failed calculation nodes:
And its attributes: In [1]: c = load_node(896549)
In [2]: c.attributes
Out[2]:
{'job_id': '1219409',
'sealed': True,
'version': {'core': '1.3.0', 'plugin': '1.1.0'},
'withmpi': True,
'exception': 'aiida.common.exceptions.OutputParsingError: CP2K did not finish properly.\n',
'resources': {'num_machines': 2, 'num_mpiprocs_per_machine': 36},
'append_text': '',
'parser_name': 'lsmo.cp2k_advanced_parser',
'prepend_text': '',
'last_job_info': {'title': 'aiida-896549',
'job_id': '1219409',
'raw_data': ['1219409',
'R',
'None',
'h078',
'ongari',
'2',
'72',
'h[078,080]',
'parallel',
'3-00:00:00',
'5:03:06',
'2020-08-10T18:06:28',
'aiida-896549',
'2020-08-10T16:17:45'],
'job_owner': 'ongari',
'job_state': 'running',
'annotation': 'None',
'queue_name': 'parallel',
'num_machines': 2,
'num_mpiprocs': 72,
'dispatch_time': {'date': '2020-08-10T18:06:28.000000', 'timezone': None},
'submission_time': {'date': '2020-08-10T16:17:45.000000', 'timezone': None},
'allocated_machines_raw': 'h[078,080]',
'wallclock_time_seconds': 18186,
'requested_wallclock_time_seconds': 259200},
'process_label': 'Cp2kCalculation',
'process_state': 'excepted',
'retrieve_list': ['aiida.out',
'aiida-1.restart',
'aiida-pos-1.dcd',
'_scheduler-stdout.txt',
'_scheduler-stderr.txt'],
'input_filename': 'aiida.inp',
'remote_workdir': '/scratch/ongari/aiida_run/d4/11/b7ad-c962-4997-bab6-0788b3fdec49',
'output_filename': 'aiida.out',
'scheduler_state': 'done',
'scheduler_stderr': '_scheduler-stderr.txt',
'scheduler_stdout': '_scheduler-stdout.txt',
'detailed_job_info': {'retval': 0,
'stderr': '',
'stdout': 'AllocCPUS|Account|AssocID|AveCPU|AvePages|AveRSS|AveVMSize|Cluster|Comment|CPUTime|CPUTimeRAW|DerivedExitCode|Elapsed|Eligible|End|ExitCode|GID|Group|JobID|JobName|MaxRSS|MaxRSSNode|MaxRSSTask|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|MinCPU|MinCPUNode|MinCPUTask|NCPUS|NNodes|NodeList|NTasks|Priority|Partition|QOSRAW|ReqCPUS|Reserved|ResvCPU|ResvCPURAW|Start|State|Submit|Suspended|SystemCPU|Timelimit|TotalCPU|UID|User|UserCPU|\n72|lsmo|882|||||helvetios||15-10:15:36|1332936|0:0|05:08:33|2020-08-10T16:17:45|Unknown|0:0|10697|lsmo|1219409|aiida-896549||||||||||72|2|h[078,080]||5838|parallel|1|72|01:48:43|5-10:27:36|469656|2020-08-10T18:06:28|RUNNING|2020-08-10T16:17:45|00:00:00|00:00:00|3-00:00:00|00:00:00|162182|ongari|00:00:00|\n36|lsmo|882|||||helvetios||7-17:07:48|666468||05:08:33|2020-08-10T18:06:28|Unknown|0:0|||1219409.batch|batch||||||||||36|1|h078|1||||36||||2020-08-10T18:06:28|RUNNING|2020-08-10T18:06:28|00:00:00|00:00:00||00:00:00|||00:00:00|\n72|lsmo|882|||||helvetios||15-10:15:36|1332936||05:08:33|2020-08-10T18:06:28|Unknown|0:0|||1219409.extern|extern||||||||||72|2|h[078,080]|2||||72||||2020-08-10T18:06:28|RUNNING|2020-08-10T18:06:28|00:00:00|00:00:00||00:00:00|||00:00:00|\n72|lsmo|882|||||helvetios||15-10:13:12|1332792||05:08:31|2020-08-10T18:06:30|Unknown|0:0|||1219409.0|cp2k.popt||||||||||72|2|h[078,080]|72||||72||||2020-08-10T18:06:30|RUNNING|2020-08-10T18:06:30|00:00:00|00:00:00||00:00:00|||00:00:00|\n'},
'mpirun_extra_params': [],
'environment_variables': {},
'max_wallclock_seconds': 259200,
'import_sys_environment': True,
'submit_script_filename': '_aiidasubmit.sh',
'retrieve_temporary_list': [],
'scheduler_lastchecktime': '2020-08-10T21:12:31.029024+00:00',
'custom_scheduler_commands': ''} The "smoking gun" would seem to be in the detailed job info, which clearly specifies that, according to slurm, the job is still in state
For context: the detailed job info is recorded inside
and actually uses sacct for the slurm scheduleraiida-core/aiida/schedulers/plugins/slurm.py Line 233 in aa9a2cb
When looking at the content of the folder today, despite the failed parsing step, the cp2k calculation on the cluster actually finished without issues. This is compatible with my suspicion that this issue occurs when AiiDA is checking squeue (either on the aiida side or, very possibly, on the slurm side). |
From our discussion, possible explanations include:
Re 2. aiida-core/aiida/schedulers/plugins/slurm.py Lines 485 to 499 in aa9a2cb
This seems a dangerous practice to me and would probably be quite straightforward to improve. P.S. Unfortunately, we don't have the complete daemon logs from the time of the failed calculation anymore to check exactly what was printed there.
In these cases, stderr was non-empty and the scheduler error was raised as designed - but perhaps under high load / special circumstances, stderr can remain empty? |
@giovannipizzi Can the logic be changed to something like: if 'invalid job id specified' not in stderr:
if stderr.strip():
self.logger.warning("Warning in _parse_joblist_output, non-empty stderr='{}'".format(stderr.strip()))
if retval !=0:
raise SchedulerError('Error during squeue parsing (_parse_joblist_output function)') From your comments in the code it is not entirely clear to me whether the error raised by squeue when specifying job ids of jobs that are no longer known to squeue also results in some identifiable trace in stderr.
Same with slurm 20.02.2 at CSCS |
@giovannipizzi I now also tested with slurm 17.11 on Quantum Mobile:
I.e. so far I have encountered only two responses:
Is there a slurm version that has a different behavior for 2.? |
I don't remember the exact details of the old behaviour (but in the header it's written which version it was). Also, I don't know if the answer you get depends also on whether the (competed) job has been removed from the SLURM history or not. On Daint/Fidis, is the logic the same (a single job gives error, multiple don't)? Also, maybe what happens is that there is the string printed in output, and then it crashes? Could this have been the case? I think the old logic would have tried to parse with the consequences described. But also your suggestion would have the same issue. Is there a way to check if this is the case, e.g. increasing the verbosity? How often does the problem appear? |
See also this thread mllg/batchtools#201 for this and similar timeout issues. Maybe we want to say that if there is more stderr, we stop? Is this going to be safe enough? Or we risk that things will remain queued forever because of some additional stderr on some machines? |
I just checked - it does not (would also be very surprising if it did).
Yes (I checked).
Right. I checked that simply appending a comma does not change things.
This results in the same behavior of slurm independently of the length of the list of jobs. Alternatively, we implement a simple logic: if list of jobs we are querying for has length 1 (I hope there is a way to access this from
Well, I was assuming that the exit code in case of a crash is non-zero (hopefully?).
Looking at the squeue manpage there is an option for verbose output but I doubt this will help
There may be a record of squeue crashes at the operating system level (e.g. in
Well, first of all I think we should not be ignoring a non-zero exit code (irrespective of whether stderr is empty or not). |
If we go this route, make sure to include an extensive comment 😅 |
Very clever! I agree it's a bit of a trick, though. Maybe better to just do if/else then (and hope that our inference of the behaviour of slurs is correct, also for other versions...
Are you sure? I think you wrote: if 'invalid job id specified' not in stderr:
if stderr.strip():
self.logger.warning("Warning in _parse_joblist_output, non-empty stderr='{}'".format(stderr.strip()))
if retval !=0:
raise SchedulerError('Error during squeue parsing (_parse_joblist_output function)') So, if the stderr includes
I agree. At the time of the first implementation, I couldn't find a better way.
You might want to write to the admins, you have the job ID and a quite precise time when this happened. At least we can confirm the issue and prepare a proper fix. Feel free to prepare some PR improving on the current situation - ideally, it would be great however if we manage to reproduce at least once the problem, with some extensive logging that shows the exact stderr and stdout, so we can confirm that we have a good fix. |
Concerning this, I had a question was whether I can access the original joblist from the That makes it a bit more fragile.
Perhaps I misunderstood what you meant by "this issue" - I was referring to the fact that a non-zero exit code with empty stderr would previously go undetected (which would be fixed by the logic I wrote). While it's true that it may still be problematic to ignore the exit code simply because we find If you want to solve this remaining problem as well, I don't really see another solution than the modification I proposed for the joblist passed to squeue, i.e. making sure (in one way or another) that all the squeue commands we issue are expected to return exit code 0.
I will contact them - let's see whether we can find something out. |
Ok - I don't think you can access the list of jobs from So I think it's ok to proceed with the trick of putting twice the last job id (and explaining why) |
Unfortunately, we need to reopen this issue since #4332 was evidently not enough to resolve it - we are still encountering this in AiiDA 1.5.2. One new piece of information: I have encountered this issue on a local slurm installation when submitting a medium-size amount of jobs (say, 100) which should really not be enough to overload slurm in the sense that it starts behaving outside specifications. Connection cooldown time was 0s (local computer); I did not touch the minimum scheduler poll interval, so I assume it is 10s. I have access to the slurm logs and I can see that
Edit: Sorry, I got a bit ahead of myself there, comparing different timings. I've edited the list now so that it correctly reflects what I understand now. In particular, I don't seem to have any cases of jobs being retrieved before submission to slurm has completed, so my original thought of the reason may have been wrong. |
So let's see if we can get the timeline correct here
Even though you striked out your initial suspicion, just to answer your question: AiiDA will decide whether the That being said, as you stated, in your case, none of these seem to apply, because the jobs are actually confirmed to be scheduled by SLURM before the last check time of AiiDA. Somehow though, at some query iteration, AiiDA still decides they are completed and starts to retrieve, even though they are actually still queued in SLURM. Here we should really just see what the exact response was of the |
As referenced in the mailing list, I've been encountering some issues with the scheduler mistaking jobs as completed when they are queued by the scheduler or in progress. Note that it's somewhat difficult to filter out the issue since it occurs usually as a result of queuing multiple processes. I'll try to paste relevant information here:
Following is from the log file (verbosity = DEBUG) grepped by process PK
The tracejob for a similar failed job:
logfile surrounding an error
@sphuber to address your comment posted to the aiida group, initial tests indicate that no extra jobs are spawned (a count of jobs on the scheduler matches the count of processes submitting jobs), but I'll have to investigate this further in order to have a better idea of what's going on. |
From time to time, AiiDA will retrieve the files in a calculation folder before the calculation has finished.
When checking the calculation folder using
gotocomputer
at a later point in time, one finds that the calculation actually finished fine.From a user perspective, this issue is very similar to #3404, but we've now established (see below) that there are two different failure modes.
In the case discussed here, AiiDA gets an incorrect job list (which no longer contains the job in question) and therefore starts retrieving it.
===========
@sphuber We ran into this issue on helvetios/fidis.
Now, following up on your suggestion to investigate the failure mode, I have looked at the historical job records of both clusters during that time using
sacct
. In particular, one can query for the job ids and work directories of past jobs as follows:It turned out that the list of work directories is unique.
If the reason for the failure was that AiiDA submitted the same job twice (causing one or both of them to fail), then we should see the corresponding work directories appear twice in the list.
I.e. in this particular case, I believe the issue occurs at the point where AiiDA queries the scheduler for the job status and gets an incomplete/incorrect answer.
As mentioned during our discussion, one solution of this issue could be to add a fail-safe mechanism when AiiDA gets the information that a job has completed. This could be any of
squeue
again, just to be surescontrol show job <id>
sacct -j <id>
and only accepting the fact that the job is no longer running, if the output agrees.
Mentioning @danieleongari for info
Originally posted by @ltalirz in #3404 (comment)
The text was updated successfully, but these errors were encountered: