New option to sos execute for multi-node task execution #1277

BoPeng · 2019-07-23T04:15:11Z

sos execute needs to know what executor to use. Giving that most tasks are submitted by the sos-pbs module with a template-generated bash file similar to

%PBS ..

sos execute {tasks}

it is most natural to change the template to something like

%PBS...

sos execute -e pbs {tasks}

So that sos execute can look for an executor called pbs and use it to execute specified tasks.

It is possible to let the pbs task engine to do some magic but making the option explicit seems to be better and allows easier testing.

The text was updated successfully, but these errors were encountered:

BoPeng · 2019-07-24T17:58:24Z

Option -e executor is added to command sos execute. However, it could have the following uses:

sos execute task_id -e executor can execute a single master task with id task_id on multiple nodes.
sos execute task1 task2 task3 -e executor can execute multiple tasks on multiple nodes. In this case tasks can be single or master tasks.

Since 2 is more general, I think the interface should allow for both cases, in which case task_id itself should not know the number of nodes (so the trunk_workers=(2,4) idea should be discarded).

However, the PBS task manager currently submits one job for each task, with a template in the format of

%PBS -l nodes={nodes}:ppn={cores}

sos execute {task} -v {verbosity} -s {sig_mode} -m {run_mode} -e pbs

We should relax it to allow the submission of multiple tasks in one shell script, something like

%PBS -l nodes={nodes}:ppn={cores}

sos execute {' '.join(tasks)} -v {verbosity} -s {sig_mode} -m {run_mode} -e pbs

However, it is unclear how to control how many tasks should go into one PBS job. Should we add another option, such as batch_size to control that?

In the end, assuming the interface to be

trunk_size, trunk_workers are kept intact.
New batch_size option to control the number of tasks per job.
New nodes option to control the number of nodes for the job.

With aforementioned template, we have the following scenarios

syntax	effect	pros	cons
	One task per job, running in one node
`trunk_size=10`	Group tasks into master tasks	Fewer tasks
`trunk_size=0`	Group all tasks into a master task	Fewer tasks
`batch_size=10`	Group tasks into jobs	Fewer jobs	same number of tasks, single node per job
`batch_size=0`	Execute all tasks in one job	Fewer job	same number of tasks, single node execution
`cores=5`	error? ignore?
`cores=5` with `trunk_size`	spread subtasks to nodes
`cores=5` with `batch_size`	spread tasks to nodes
`cores=5` with `trunk_size` and `batch_size`	spread subtasks to nodes

gaow · 2019-07-25T02:20:36Z

Sorry I'm traveling until the end of this week ... also sorry I want to clarify some basic stuff:

sos execute needs to know what executor to use.

How is this relevant to multi-node task execution?

t is possible to let the pbs task engine to do some magic

What magic -- could you give an example?

Option -e executor is added to command sos execute.

Cool! I'll now try to digest the new proposed interface and see if I can understand them easily.

gaow · 2019-07-25T02:42:54Z

trunk_size = 0

This is awkward ... trunk_size = None is a bit better?

I'm trying to simply the interface from a user's prospective. The previous interface with trunk_size and trunk_works are already a bit too much in my view. I'm not sure why a user have to understand difference between subtask and master task. The smallest unit of execution, to an SoS user, should be substeps. cores and mem specifies resources needed to run each substep. With that taken care of, users will ask the question of how to run substeps on a job queue system. trunk_size essentially controls how many substeps to group into one job script, trunk_workers controls the extend of parallelization for substeps in each trunk. All of this happens in one node.

Now to a user, the only difference is that the above can happen on multiple nodes. That is, by changing from trunk_workers = n_threads to trunk_workers = (n_nodes, n_threads) (or, maybe trunk_workers = 2x3 instead of (2,3)) users will be able to increase trunk_size thus reducing the number of jobs submitted. These are more complicated, but still intuitive to me; and is well motivated. I'm not seeing a place for distinguishing subtask with master task -- that sounds like a technical detail we can somehow hide?

BoPeng · 2019-07-25T03:03:26Z

I'll now try to digest the new proposed interface and see if I can understand them easily.

Basically right now the task executor inside sos executes on one node, with optional multiprocessing (trunk_workers). If we define another task executor inside the sos-pbs package, or something else in the sos-kubernetes package, we need someway for command

sos execute

to know which executor to use. Currently the solution is to use

sos execute -e pbs

which as an entry point refers to sos_pbs.task_executor:PBSTaskExecutor.

As I said, we could be more clever to say sos executor uses task engine specific task executor, but -e is a bit more straightforward.

Anyway, none of these is fixed, and the entire batch_size, nodes stuff are up to discussion. Personally I sometimes want to avoid master tasks because it can be tricky to figure out what went wrong with a particular subtask, but perhaps we should address that problem directly instead of introducing batch_size.

BoPeng · 2019-07-25T03:25:52Z

cores and mem specifies resources needed to run each substep.

The problem here is right now all substeps are single-node jobs and cores are not used. It is possible for us to allow for multi-node single-task later (#476), but we are not there yet, and this could conflict with trunk_workers=(n_node, n_proc).

I do agree with your single-task concept. That is to say, we can remove the concept of master task (not showing any subtask ID), so basically a task can have "work" for one or more (trunk_size) substeps, and the work can be spread to multiple workers in one or more nodes (trunk_workers).

gaow · 2019-07-25T05:30:52Z

I see. So -e executor is more like an internal or intermediate feature at least at this point when we are experimenting new execution models. I was confused earlier because I thought the current execution model is a subset, or special case at least from user's prospective, of multi-node task execution so I was not expecting a separate user interface even though under the hood they are quite different for now. sos-kubernetes is a different story which justifies the -e option better. Still, the whole sos execute stuff is somewhat an intermediate feature and eventually I think the interface to switch to kubernetes will have to end up in either sos run command, or some templates such as config.yml or a job script template where sos execute will become relevant (i say it is intermediate feature because it seems to mostly appear in job templates).

right now all substeps are single-node jobs and cores are not used

I understand all substeps are now single-node jobs. But cores are used to specify number of CPU used per substep -- is that right? Then cores are directly used to reserve CPU for a substep. Not sure what it is meant by "cores are not used".

It is possible for us to allow for multi-node single-task later

I think I now understand what you meant -- similar to using cores directly to specify CPU requested, we can also use something like nodes to specify nodes used. But cores can be confused with trunk_works = n_proc because then the actual number of CPUs used in a single node would be cores * n_proc, right? So it is not so much a conflict as it is just harder to interpret, in the single node multiple threads case. I'm not sure if the problem for multiple nodes is of similar nature, or there actually will be some conflict for that case.

BoPeng · 2019-07-26T14:02:32Z

So with #1278, let us clarify that

nodes, similar to cores, mem, walltime specify the resources used by one task. That is to say, until Support for single multi-node task #1278 is resolved, nodes will continue to be left unimplemented.
trunk_workers=[n_proc]*n_nodes will be used to specify number of nodes. The advantage of this format is not to allow different number of processes in different nodes, but to avoid confusing between (n_proc, n_nodes), or (n_nodes, n_proc).
If one of these two options are specified, {nodes} in the template file will be correctly determined.
If both of these two options are specified, things will become complicated so it makes sense to disallow it. There should be rarely a need to use trunk_worker for multi-node jobs.

Implementation wise, the task executor needs to differentiate between the two cases

In the nodes case, the tasks will have no trunk_worker, so the same task will be executed on multiple nodes in single-node mode. The underlying mpi job will be responsible for communication between multiple nodes.
In the trunk_worker case, the tasks will know the number of nodes, so sos will run in multi-node mode.

gaow · 2019-07-28T16:45:33Z

nodes, similar to cores, mem, walltime specify the resources used by one task. That is to say, until #1278 is resolved, nodes will continue to be left unimplemented.

node, if similar to cores etc, should goes directly into the job template without impacting behavior of sos execute. It specifies the computational requirement of a substep. In most cases for daily computations we do not use mpi applications, and thats makes nodes=1 most of the time. Is that true? My point is that unless users have an mpi application, nodes will not do any magic. It is just a parameter in template. This is why I'm not sure I understand #1278 -- for a single task (or, substep), SoS should not have done anything to execute on multiple nodes; the substep determines if the job is multi-nodes or single-node.

trunk_workers=[n_proc]*n_nodes

what does [ ] mean?, for different number of processes in different nodes? Not sure why that is needed ... except that the syntax does (awkwardly) help avoid confusion.

it makes sense to disallow it. There should be rarely a need to use trunk_worker for multi-node jobs.

So this seems to confirm my first response in this post that node is simply specifying number of nodes for a multi-node substep. ..?

Implementation wise ...

I think we are on the same page but I just want to confirm. The use of terminology task is a bit confusing because I keep thinking of sos execute implementation details. Would be nice to focus on the concept of "substep"&"job" and hide the notion of "task", when we introduce this to users on the documentation? All they need to worry about is how substeps are scheduled executed in a job queue system. I always consider sos execute task an internal feature that users do not have to bother with.

BoPeng · 2019-07-28T20:13:39Z

This is why I'm not sure I understand #1278

#1278 implements parameter nodes that allows the execution of multi-node tasks, which would essentially allow the execution of

task: nodes=5
sh:
mpirun mpi_job

through a template

%PBS -l {nodes}:{cores}

sos execute {task_id}

BoPeng · 2019-08-08T04:42:42Z

Here is the proposed interface: Make trunk_workers equivalent to -j from command line so that it accepts,

trunk_workers=4 means localhost 4 processes.
trunk_workers=None or by default, gets -j on cluster. That is to say, if a user creates a shell script with

#PBS nodes=2:ppn=4
sos execute task_id

It will be translated to

sos execute task_id -j node1:4 node2:4

and the task will be processed by 8 processes on two nodes. Either for testing or with real application,

sos execute -j 4 node1:4 node2:4 task_id

will be allowed to process task using 12 processes on three computers (non-cluster). Here the usage of -j is identical to that of sos run.

A culprit of this usage is that SoS uses trunk_workers to estimate total walltime and mem etc. For example, if cores=2 and trunk_workers=2, SoS would say 4 cores are needed. We will not be able to do this with trunk_workers=None. For case 2 this is ok since users are specifying a fixed template or use -j directly without walltime restraint etc, but walltime etc for case 3 (the next case) will cause problem. We might have to find out this case and warn users.

A more common usage, is however the use of templates. In this case, trunk_worker=[4, 4, 4] means three nodes each with 4 processes. In theory it could accept trunk_worker=['node1:4', 'node2:4', 'node3:4']. However, SoS will only derive {nodes} and {ppn} from this parameter, and treat other values (e.g. trunk_worker=[4, 3]) as error, and this is why the trunk_worker=[4]*3 expression makes more sense here.

In this case, our template

#PBS nodes={nodes}:ppn={cores}

sos execute task_id -e pbs

would generate -j that matches trunk_worker.

An error will be created for non-matching -j and trunk_worker, regardless if -j is specified from command line or derived from $PBS_NODEFILE.

All these are assuming nodes=1, meaning that all tasks are single-node tasks. In case of multi-node single task (#1278), nodes will populate {nodes} in the template, which will lead to correct -j. However, because such tasks do not have trunk_worker, we will execute the task in the master worker, and the task will spawn itself to multiple nodes by itself.

gaow · 2019-08-08T13:50:25Z

Make trunk_workers equivalent to -j from command line

Previously, -j controls how many processes to use on the machine that manages tasks, when -J is also used. That usage was more or less an administrative feature not related to job specifications. I wonder what happens to that (resource to manage jobs) when we make -j = trunk_workers.

A more common usage, is however the use of templates.

I"m still confused in this case. Yes either ['node1:4', 'node2:4', 'node3:4'] or [4]*3 makes sense. The latter perhaps more so, because typically we do not specify explicitly what core to use on which node . So why [4] * 3 not 4 * 3 (which is cleaner)?

BoPeng · 2019-08-08T13:55:39Z

why [4] * 3 not 4 * 3 (which is cleaner)?

[4]*3 in Python is just [4, 4, 4], which is what we would get from -j 4 4 4 (actually ['4', '4', '4']). 4*3 will be evaluated to 12 if we do task: trunk_workers=4*3.

gaow · 2019-08-08T13:57:57Z

I mean we should write a function to parse 4*3 to [4] * 3, to make it easier for users.

BoPeng · 2019-08-08T13:58:30Z

Previously, -j controls how many processes to use on the machine that manages tasks,

-j specifies number of workers for steps, substeps, and subworkflows for the execution of sos workflows, NOT tasks as tasks are executed outside of sos instances. In the case of sos execute task_id -j we are specifying the number of workers for (sub)tasks.

gaow · 2019-08-08T14:02:26Z

Yes I was aware -j is not for tasks. But I thought when used with -J (for tasks) in sos run, it controls how many processes used to communicate and manage tasks. That is why previously by default having -j max_cpu/2 means -j 14 on my computer and thus using too much memory. We fixed that by setting -j 4 -- do you recall that? Then I was left with the impression that we need to set this "admin" parameter, until this proposed new usage.

BoPeng · 2019-08-08T14:03:06Z

I mean we should write a function to parse 4*3 to [4] * 3, to make it easier for users.

I see, that has to be

trunk_workers='4*3'

not

trunk_workers=4*3

However, because we support

trunk_workers=4

(not string) and we are technically speaking allowing

trunk_workers=['node1:4', 'node2:4']

I felt that extending from

trunk_workers=4

to

trunk_workers=[4, 4, 4]

and then to

trunk_workers=['node1:4', 'node2:4', 'node3:4']

makes more sense.

BoPeng · 2019-08-08T14:06:42Z

But I thought when used with -J (for tasks) in sos run, it controls how many processes used to communicate and manage tasks.

-J controls number of active tasks in a queue, basically only used for -J 1 to override max_running_jobs setting of a queue for debugging usages. I would remove it if I can find an alternative method to achieve this particular usage.

gaow · 2019-08-08T14:45:12Z

Okay, I agree using numeric values consistently for trunk_workers is good idea. Maybe I've got a misconception on -j but the new interface makes sense. I'll reserve my doubts on resource usage in "admin" threads to later after we have the new interface implemented and I'll test on that.

BoPeng assigned gaow Jul 24, 2019

BoPeng mentioned this issue Jul 25, 2019

Support for single multi-node task #1278

Closed

BoPeng pushed a commit that referenced this issue Jul 26, 2019

Disallow the concurrent use options trunk_workers and nodes #1277

b7d68f5

BoPeng pushed a commit that referenced this issue Aug 8, 2019

Handle more complex format of trunk_workers #1277

ff68509

BoPeng closed this as completed Aug 10, 2019

BoPeng pushed a commit that referenced this issue Aug 20, 2019

Allow the use of value None for trunk_size #1277

f9b3ef0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New option to sos execute for multi-node task execution #1277

New option to sos execute for multi-node task execution #1277

BoPeng commented Jul 23, 2019

BoPeng commented Jul 24, 2019 •

edited

Loading

gaow commented Jul 25, 2019

gaow commented Jul 25, 2019

BoPeng commented Jul 25, 2019

BoPeng commented Jul 25, 2019

gaow commented Jul 25, 2019

BoPeng commented Jul 26, 2019 •

edited

Loading

gaow commented Jul 28, 2019 •

edited

Loading

BoPeng commented Jul 28, 2019 •

edited

Loading

BoPeng commented Aug 8, 2019 •

edited

Loading

gaow commented Aug 8, 2019

BoPeng commented Aug 8, 2019 •

edited

Loading

gaow commented Aug 8, 2019

BoPeng commented Aug 8, 2019

gaow commented Aug 8, 2019

BoPeng commented Aug 8, 2019

BoPeng commented Aug 8, 2019

gaow commented Aug 8, 2019

New option to sos execute for multi-node task execution #1277

New option to sos execute for multi-node task execution #1277

Comments

BoPeng commented Jul 23, 2019

BoPeng commented Jul 24, 2019 • edited Loading

gaow commented Jul 25, 2019

gaow commented Jul 25, 2019

BoPeng commented Jul 25, 2019

BoPeng commented Jul 25, 2019

gaow commented Jul 25, 2019

BoPeng commented Jul 26, 2019 • edited Loading

gaow commented Jul 28, 2019 • edited Loading

BoPeng commented Jul 28, 2019 • edited Loading

BoPeng commented Aug 8, 2019 • edited Loading

gaow commented Aug 8, 2019

BoPeng commented Aug 8, 2019 • edited Loading

gaow commented Aug 8, 2019

BoPeng commented Aug 8, 2019

gaow commented Aug 8, 2019

BoPeng commented Aug 8, 2019

BoPeng commented Aug 8, 2019

gaow commented Aug 8, 2019

BoPeng commented Jul 24, 2019 •

edited

Loading

BoPeng commented Jul 26, 2019 •

edited

Loading

gaow commented Jul 28, 2019 •

edited

Loading

BoPeng commented Jul 28, 2019 •

edited

Loading

BoPeng commented Aug 8, 2019 •

edited

Loading

BoPeng commented Aug 8, 2019 •

edited

Loading