Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New option to sos execute for multi-node task execution #1277

Closed
BoPeng opened this issue Jul 23, 2019 · 18 comments
Closed

New option to sos execute for multi-node task execution #1277

BoPeng opened this issue Jul 23, 2019 · 18 comments
Assignees

Comments

@BoPeng
Copy link
Contributor

BoPeng commented Jul 23, 2019

#1276

sos execute needs to know what executor to use. Giving that most tasks are submitted by the sos-pbs module with a template-generated bash file similar to

%PBS ..

sos execute {tasks}

it is most natural to change the template to something like

%PBS...

sos execute -e pbs {tasks}

So that sos execute can look for an executor called pbs and use it to execute specified tasks.

It is possible to let the pbs task engine to do some magic but making the option explicit seems to be better and allows easier testing.

@BoPeng
Copy link
Contributor Author

BoPeng commented Jul 24, 2019

Option -e executor is added to command sos execute. However, it could have the following uses:

  1. sos execute task_id -e executor can execute a single master task with id task_id on multiple nodes.
  2. sos execute task1 task2 task3 -e executor can execute multiple tasks on multiple nodes. In this case tasks can be single or master tasks.

Since 2 is more general, I think the interface should allow for both cases, in which case task_id itself should not know the number of nodes (so the trunk_workers=(2,4) idea should be discarded).

However, the PBS task manager currently submits one job for each task, with a template in the format of

%PBS -l nodes={nodes}:ppn={cores}

sos execute {task} -v {verbosity} -s {sig_mode} -m {run_mode} -e pbs

We should relax it to allow the submission of multiple tasks in one shell script, something like

%PBS -l nodes={nodes}:ppn={cores}

sos execute {' '.join(tasks)} -v {verbosity} -s {sig_mode} -m {run_mode} -e pbs

However, it is unclear how to control how many tasks should go into one PBS job. Should we add another option, such as batch_size to control that?

In the end, assuming the interface to be

  1. trunk_size, trunk_workers are kept intact.
  2. New batch_size option to control the number of tasks per job.
  3. New nodes option to control the number of nodes for the job.

With aforementioned template, we have the following scenarios

syntax effect pros cons
  One task per job, running in one node    
trunk_size=10 Group tasks into master tasks Fewer tasks
trunk_size=0 Group all tasks into a master task Fewer tasks
batch_size=10 Group tasks into jobs Fewer jobs same number of tasks, single node per job
batch_size=0 Execute all tasks in one job Fewer job same number of tasks, single node execution
cores=5 error? ignore?
cores=5 with trunk_size spread subtasks to nodes
cores=5 with batch_size spread tasks to nodes
cores=5 with trunk_size and batch_size spread subtasks to nodes

@gaow
Copy link
Member

gaow commented Jul 25, 2019

Sorry I'm traveling until the end of this week ... also sorry I want to clarify some basic stuff:

sos execute needs to know what executor to use.

How is this relevant to multi-node task execution?

t is possible to let the pbs task engine to do some magic

What magic -- could you give an example?

Option -e executor is added to command sos execute.

Cool! I'll now try to digest the new proposed interface and see if I can understand them easily.

@gaow
Copy link
Member

gaow commented Jul 25, 2019

trunk_size = 0

This is awkward ... trunk_size = None is a bit better?

I'm trying to simply the interface from a user's prospective. The previous interface with trunk_size and trunk_works are already a bit too much in my view. I'm not sure why a user have to understand difference between subtask and master task. The smallest unit of execution, to an SoS user, should be substeps. cores and mem specifies resources needed to run each substep. With that taken care of, users will ask the question of how to run substeps on a job queue system. trunk_size essentially controls how many substeps to group into one job script, trunk_workers controls the extend of parallelization for substeps in each trunk. All of this happens in one node.

Now to a user, the only difference is that the above can happen on multiple nodes. That is, by changing from trunk_workers = n_threads to trunk_workers = (n_nodes, n_threads) (or, maybe trunk_workers = 2x3 instead of (2,3)) users will be able to increase trunk_size thus reducing the number of jobs submitted. These are more complicated, but still intuitive to me; and is well motivated. I'm not seeing a place for distinguishing subtask with master task -- that sounds like a technical detail we can somehow hide?

@BoPeng
Copy link
Contributor Author

BoPeng commented Jul 25, 2019

I'll now try to digest the new proposed interface and see if I can understand them easily.

Basically right now the task executor inside sos executes on one node, with optional multiprocessing (trunk_workers). If we define another task executor inside the sos-pbs package, or something else in the sos-kubernetes package, we need someway for command

sos execute

to know which executor to use. Currently the solution is to use

sos execute -e pbs

which as an entry point refers to sos_pbs.task_executor:PBSTaskExecutor.

As I said, we could be more clever to say sos executor uses task engine specific task executor, but -e is a bit more straightforward.

Anyway, none of these is fixed, and the entire batch_size, nodes stuff are up to discussion. Personally I sometimes want to avoid master tasks because it can be tricky to figure out what went wrong with a particular subtask, but perhaps we should address that problem directly instead of introducing batch_size.

@BoPeng
Copy link
Contributor Author

BoPeng commented Jul 25, 2019

cores and mem specifies resources needed to run each substep.

The problem here is right now all substeps are single-node jobs and cores are not used. It is possible for us to allow for multi-node single-task later (#476), but we are not there yet, and this could conflict with trunk_workers=(n_node, n_proc).

I do agree with your single-task concept. That is to say, we can remove the concept of master task (not showing any subtask ID), so basically a task can have "work" for one or more (trunk_size) substeps, and the work can be spread to multiple workers in one or more nodes (trunk_workers).

@gaow
Copy link
Member

gaow commented Jul 25, 2019

I see. So -e executor is more like an internal or intermediate feature at least at this point when we are experimenting new execution models. I was confused earlier because I thought the current execution model is a subset, or special case at least from user's prospective, of multi-node task execution so I was not expecting a separate user interface even though under the hood they are quite different for now. sos-kubernetes is a different story which justifies the -e option better. Still, the whole sos execute stuff is somewhat an intermediate feature and eventually I think the interface to switch to kubernetes will have to end up in either sos run command, or some templates such as config.yml or a job script template where sos execute will become relevant (i say it is intermediate feature because it seems to mostly appear in job templates).

right now all substeps are single-node jobs and cores are not used

I understand all substeps are now single-node jobs. But cores are used to specify number of CPU used per substep -- is that right? Then cores are directly used to reserve CPU for a substep. Not sure what it is meant by "cores are not used".

It is possible for us to allow for multi-node single-task later

I think I now understand what you meant -- similar to using cores directly to specify CPU requested, we can also use something like nodes to specify nodes used. But cores can be confused with trunk_works = n_proc because then the actual number of CPUs used in a single node would be cores * n_proc, right? So it is not so much a conflict as it is just harder to interpret, in the single node multiple threads case. I'm not sure if the problem for multiple nodes is of similar nature, or there actually will be some conflict for that case.

@BoPeng
Copy link
Contributor Author

BoPeng commented Jul 26, 2019

So with #1278, let us clarify that

  1. nodes, similar to cores, mem, walltime specify the resources used by one task. That is to say, until Support for single multi-node task #1278 is resolved, nodes will continue to be left unimplemented.

  2. trunk_workers=[n_proc]*n_nodes will be used to specify number of nodes. The advantage of this format is not to allow different number of processes in different nodes, but to avoid confusing between (n_proc, n_nodes), or (n_nodes, n_proc).

  3. If one of these two options are specified, {nodes} in the template file will be correctly determined.

  4. If both of these two options are specified, things will become complicated so it makes sense to disallow it. There should be rarely a need to use trunk_worker for multi-node jobs.

Implementation wise, the task executor needs to differentiate between the two cases

  1. In the nodes case, the tasks will have no trunk_worker, so the same task will be executed on multiple nodes in single-node mode. The underlying mpi job will be responsible for communication between multiple nodes.

  2. In the trunk_worker case, the tasks will know the number of nodes, so sos will run in multi-node mode.

@gaow
Copy link
Member

gaow commented Jul 28, 2019

nodes, similar to cores, mem, walltime specify the resources used by one task. That is to say, until #1278 is resolved, nodes will continue to be left unimplemented.

node, if similar to cores etc, should goes directly into the job template without impacting behavior of sos execute. It specifies the computational requirement of a substep. In most cases for daily computations we do not use mpi applications, and thats makes nodes=1 most of the time. Is that true? My point is that unless users have an mpi application, nodes will not do any magic. It is just a parameter in template. This is why I'm not sure I understand #1278 -- for a single task (or, substep), SoS should not have done anything to execute on multiple nodes; the substep determines if the job is multi-nodes or single-node.

trunk_workers=[n_proc]*n_nodes

what does [ ] mean?, for different number of processes in different nodes? Not sure why that is needed ... except that the syntax does (awkwardly) help avoid confusion.

it makes sense to disallow it. There should be rarely a need to use trunk_worker for multi-node jobs.

So this seems to confirm my first response in this post that node is simply specifying number of nodes for a multi-node substep. ..?

Implementation wise ...

I think we are on the same page but I just want to confirm. The use of terminology task is a bit confusing because I keep thinking of sos execute implementation details. Would be nice to focus on the concept of "substep"&"job" and hide the notion of "task", when we introduce this to users on the documentation? All they need to worry about is how substeps are scheduled executed in a job queue system. I always consider sos execute task an internal feature that users do not have to bother with.

@BoPeng
Copy link
Contributor Author

BoPeng commented Jul 28, 2019

This is why I'm not sure I understand #1278

#1278 implements parameter nodes that allows the execution of multi-node tasks, which would essentially allow the execution of

task: nodes=5
sh:
mpirun mpi_job

through a template

%PBS -l {nodes}:{cores}

sos execute {task_id}

@BoPeng
Copy link
Contributor Author

BoPeng commented Aug 8, 2019

Here is the proposed interface: Make trunk_workers equivalent to -j from command line so that it accepts,

  1. trunk_workers=4 means localhost 4 processes.

  2. trunk_workers=None or by default, gets -j on cluster. That is to say, if a user creates a shell script with

#PBS nodes=2:ppn=4
sos execute task_id

It will be translated to

sos execute task_id -j node1:4 node2:4

and the task will be processed by 8 processes on two nodes. Either for testing or with real application,

sos execute -j 4 node1:4 node2:4 task_id

will be allowed to process task using 12 processes on three computers (non-cluster). Here the usage of -j is identical to that of sos run.

A culprit of this usage is that SoS uses trunk_workers to estimate total walltime and mem etc. For example, if cores=2 and trunk_workers=2, SoS would say 4 cores are needed. We will not be able to do this with trunk_workers=None. For case 2 this is ok since users are specifying a fixed template or use -j directly without walltime restraint etc, but walltime etc for case 3 (the next case) will cause problem. We might have to find out this case and warn users.

  1. A more common usage, is however the use of templates. In this case, trunk_worker=[4, 4, 4] means three nodes each with 4 processes. In theory it could accept trunk_worker=['node1:4', 'node2:4', 'node3:4']. However, SoS will only derive {nodes} and {ppn} from this parameter, and treat other values (e.g. trunk_worker=[4, 3]) as error, and this is why the trunk_worker=[4]*3 expression makes more sense here.

In this case, our template

#PBS nodes={nodes}:ppn={cores}

sos execute task_id -e pbs

would generate -j that matches trunk_worker.

  1. An error will be created for non-matching -j and trunk_worker, regardless if -j is specified from command line or derived from $PBS_NODEFILE.

All these are assuming nodes=1, meaning that all tasks are single-node tasks. In case of multi-node single task (#1278), nodes will populate {nodes} in the template, which will lead to correct -j. However, because such tasks do not have trunk_worker, we will execute the task in the master worker, and the task will spawn itself to multiple nodes by itself.

@gaow
Copy link
Member

gaow commented Aug 8, 2019

Make trunk_workers equivalent to -j from command line

Previously, -j controls how many processes to use on the machine that manages tasks, when -J is also used. That usage was more or less an administrative feature not related to job specifications. I wonder what happens to that (resource to manage jobs) when we make -j = trunk_workers.

A more common usage, is however the use of templates.

I"m still confused in this case. Yes either ['node1:4', 'node2:4', 'node3:4'] or [4]*3 makes sense. The latter perhaps more so, because typically we do not specify explicitly what core to use on which node . So why [4] * 3 not 4 * 3 (which is cleaner)?

@BoPeng
Copy link
Contributor Author

BoPeng commented Aug 8, 2019

why [4] * 3 not 4 * 3 (which is cleaner)?

[4]*3 in Python is just [4, 4, 4], which is what we would get from -j 4 4 4 (actually ['4', '4', '4']). 4*3 will be evaluated to 12 if we do task: trunk_workers=4*3.

@gaow
Copy link
Member

gaow commented Aug 8, 2019

I mean we should write a function to parse 4*3 to [4] * 3, to make it easier for users.

@BoPeng
Copy link
Contributor Author

BoPeng commented Aug 8, 2019

Previously, -j controls how many processes to use on the machine that manages tasks,

-j specifies number of workers for steps, substeps, and subworkflows for the execution of sos workflows, NOT tasks as tasks are executed outside of sos instances. In the case of sos execute task_id -j we are specifying the number of workers for (sub)tasks.

@gaow
Copy link
Member

gaow commented Aug 8, 2019

Yes I was aware -j is not for tasks. But I thought when used with -J (for tasks) in sos run, it controls how many processes used to communicate and manage tasks. That is why previously by default having -j max_cpu/2 means -j 14 on my computer and thus using too much memory. We fixed that by setting -j 4 -- do you recall that? Then I was left with the impression that we need to set this "admin" parameter, until this proposed new usage.

@BoPeng
Copy link
Contributor Author

BoPeng commented Aug 8, 2019

I mean we should write a function to parse 4*3 to [4] * 3, to make it easier for users.

I see, that has to be

trunk_workers='4*3'

not

trunk_workers=4*3

However, because we support

trunk_workers=4

(not string) and we are technically speaking allowing

trunk_workers=['node1:4', 'node2:4']

I felt that extending from

trunk_workers=4

to

trunk_workers=[4, 4, 4]

and then to

trunk_workers=['node1:4', 'node2:4', 'node3:4']

makes more sense.

@BoPeng
Copy link
Contributor Author

BoPeng commented Aug 8, 2019

But I thought when used with -J (for tasks) in sos run, it controls how many processes used to communicate and manage tasks.

-J controls number of active tasks in a queue, basically only used for -J 1 to override max_running_jobs setting of a queue for debugging usages. I would remove it if I can find an alternative method to achieve this particular usage.

@gaow
Copy link
Member

gaow commented Aug 8, 2019

Okay, I agree using numeric values consistently for trunk_workers is good idea. Maybe I've got a misconception on -j but the new interface makes sense. I'll reserve my doubts on resource usage in "admin" threads to later after we have the new interface implemented and I'll test on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants