Job status report inconsistency #1058

gaow · 2018-09-21T13:40:48Z

I've got a multi-step workflow running on PBS system. I only had the first step completed, with an error message:

ERROR: 47 jobs completed, 2 jobs aborted

I know that I submitted 49 jobs. But I do see 49 output from this step.

I then tried:

sos status | grep $name | grep completed | wc -l
49

and sos status | grep aborted returns nothing.

So according to sos status, all 49 were successful.

Sorry I cannot give a MWE (how can I provide one). Just make a note here. Maybe the new zmq will hopefully have it addressed due to general improvements? (you see I have a lot faith in that branch).

The text was updated successfully, but these errors were encountered:

gaow · 2018-09-21T13:47:06Z

BTW all my stderr file showed no problem. It is just a bit inconvenient that my intended overnight workflow did not go through due to the false alarm.

BoPeng · 2018-09-21T13:50:01Z

That branch is progressing step by step to keep the current tests passing. Right now the slot manager is gone, signatures are written by a single thread, with #1056 as a blocker for concurrent substeps. You are right that the reporting part is being reconsidered so it might fix this problem.

However, using zmq in a more sophisticated way as ipyparallel #1057 does would be very difficult to a point that we might want to just use ipyparallel although our DAG execution pattern does not really fit to their model. On the other hand they say their model is flexible enough to accommodate all task models so I will read about ipyparallel more before I attempt to use more sophisticated worker management patterns. This is why I was asking if such patterns are needed.

So I suppose I will fix #1056 and stop their before major problems with the DAG execution part are found.

BoPeng · 2018-09-21T18:41:36Z

The false alarm is likely caused by some filesystem hiccups that prevents sos from detecting the activities of the tasks (no change of pulse file after 10+ seconds). The workflow thinks the tasks were dead but the tasks continued to run and completed without problem.

The pulse mechanism is used to tell the task engine that the task is still alive. It is file based because the tasks could be executed over remote machine on a shared file system (like cluster). This is one of the major differences between sos and other systems in that sos relies on the task themselves to report that they are alive while others rely on live workers (that execute more than one tasks) to report them. That is to say,sos execute task executes a single task without contacting the task engine. The task engine checks the change of task file to determine task status. On other systems, the tasks are executed by workers with live connection to the task engine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job status report inconsistency #1058

Job status report inconsistency #1058

gaow commented Sep 21, 2018

gaow commented Sep 21, 2018

BoPeng commented Sep 21, 2018

BoPeng commented Sep 21, 2018 •

edited

Loading

Job status report inconsistency #1058

Job status report inconsistency #1058

Comments

gaow commented Sep 21, 2018

gaow commented Sep 21, 2018

BoPeng commented Sep 21, 2018

BoPeng commented Sep 21, 2018 • edited Loading

BoPeng commented Sep 21, 2018 •

edited

Loading