Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job status report inconsistency #1058

Open
gaow opened this issue Sep 21, 2018 · 3 comments
Open

Job status report inconsistency #1058

gaow opened this issue Sep 21, 2018 · 3 comments

Comments

@gaow
Copy link
Member

gaow commented Sep 21, 2018

I've got a multi-step workflow running on PBS system. I only had the first step completed, with an error message:

ERROR: 47 jobs completed, 2 jobs aborted

I know that I submitted 49 jobs. But I do see 49 output from this step.

I then tried:

sos status | grep $name | grep completed | wc -l
49

and sos status | grep aborted returns nothing.

So according to sos status, all 49 were successful.

Sorry I cannot give a MWE (how can I provide one). Just make a note here. Maybe the new zmq will hopefully have it addressed due to general improvements? (you see I have a lot faith in that branch).

@gaow
Copy link
Member Author

gaow commented Sep 21, 2018

BTW all my stderr file showed no problem. It is just a bit inconvenient that my intended overnight workflow did not go through due to the false alarm.

@BoPeng
Copy link
Contributor

BoPeng commented Sep 21, 2018

That branch is progressing step by step to keep the current tests passing. Right now the slot manager is gone, signatures are written by a single thread, with #1056 as a blocker for concurrent substeps. You are right that the reporting part is being reconsidered so it might fix this problem.

However, using zmq in a more sophisticated way as ipyparallel #1057 does would be very difficult to a point that we might want to just use ipyparallel although our DAG execution pattern does not really fit to their model. On the other hand they say their model is flexible enough to accommodate all task models so I will read about ipyparallel more before I attempt to use more sophisticated worker management patterns. This is why I was asking if such patterns are needed.

So I suppose I will fix #1056 and stop their before major problems with the DAG execution part are found.

@BoPeng
Copy link
Contributor

BoPeng commented Sep 21, 2018

The false alarm is likely caused by some filesystem hiccups that prevents sos from detecting the activities of the tasks (no change of pulse file after 10+ seconds). The workflow thinks the tasks were dead but the tasks continued to run and completed without problem.

The pulse mechanism is used to tell the task engine that the task is still alive. It is file based because the tasks could be executed over remote machine on a shared file system (like cluster). This is one of the major differences between sos and other systems in that sos relies on the task themselves to report that they are alive while others rely on live workers (that execute more than one tasks) to report them. That is to say,sos execute task executes a single task without contacting the task engine. The task engine checks the change of task file to determine task status. On other systems, the tasks are executed by workers with live connection to the task engine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants