-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job status report inconsistency #1058
Comments
BTW all my |
That branch is progressing step by step to keep the current tests passing. Right now the slot manager is gone, signatures are written by a single thread, with #1056 as a blocker for concurrent substeps. You are right that the reporting part is being reconsidered so it might fix this problem. However, using zmq in a more sophisticated way as ipyparallel #1057 does would be very difficult to a point that we might want to just use ipyparallel although our DAG execution pattern does not really fit to their model. On the other hand they say their model is flexible enough to accommodate all task models so I will read about ipyparallel more before I attempt to use more sophisticated worker management patterns. This is why I was asking if such patterns are needed. So I suppose I will fix #1056 and stop their before major problems with the DAG execution part are found. |
The false alarm is likely caused by some filesystem hiccups that prevents sos from detecting the activities of the tasks (no change of pulse file after 10+ seconds). The workflow thinks the tasks were dead but the tasks continued to run and completed without problem. The pulse mechanism is used to tell the task engine that the task is still alive. It is file based because the tasks could be executed over remote machine on a shared file system (like cluster). This is one of the major differences between sos and other systems in that sos relies on the task themselves to report that they are alive while others rely on live workers (that execute more than one tasks) to report them. That is to say, |
I've got a multi-step workflow running on PBS system. I only had the first step completed, with an error message:
I know that I submitted 49 jobs. But I do see 49 output from this step.
I then tried:
and
sos status | grep aborted
returns nothing.So according to
sos status
, all 49 were successful.Sorry I cannot give a MWE (how can I provide one). Just make a note here. Maybe the new
zmq
will hopefully have it addressed due to general improvements? (you see I have a lot faith in that branch).The text was updated successfully, but these errors were encountered: