Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New option for the specification of hostfile for sos run and sos execute #1279

Closed
BoPeng opened this issue Jul 28, 2019 · 3 comments
Closed

Comments

@BoPeng
Copy link
Contributor

BoPeng commented Jul 28, 2019

#1278

It appears clear that a hostfile is needed for multi-node execution. Although a host file can be automatically generated by PBS systems, and be picked up automatically by commands such as sos execute and sos run, it is necessary to allow this option so that users can specify it manually to allow multi-node execution of workflows and tasks.

This option should work like this:

  1. Without it, everything is run locally.
  2. With it, it should be a name to a host file, similar to the --hostfile option of SCOOP, with a similar or identical format. The workers will be created on these hosts.
  3. Under a cluster system with appropriate environmental variables, the hostfiles will be picked up automatically, similar to what SCOOP is doing

The problem is that sos run does not support -- options so we will have to reuse an existing option or find another option.

Once this option is specified, users can use

sos run -j hostfile

to run work flow on multiple hosts.

Use

%PBS ...
sos run workflow

to run entire workflow on a cluster system.

The same mechanism will be used for the execution of tasks, something like

%PBS
sos execute task
@BoPeng
Copy link
Contributor Author

BoPeng commented Jul 28, 2019

We could reuse -j and say

  1. -j 4 is 4 processes at local host
  2. -j 4 some_machine:4 is 4 at localhost and 4 on some_machine
  3. -j @file

For the last usage, we do not have to say it is a file, rather the use of @ syntax from the fromfile_prefix_chars syntax from argparse.

@gaow
Copy link
Member

gaow commented Jul 28, 2019

This interface reads intuitive. Not sure if I understand 2: when running remote tasks, -j option specifies the resource needed to manage the remote tasks on the machine the tasks are submitted. Not sure why 2 is necessary -- dont we always manage it from localhost?

@BoPeng
Copy link
Contributor Author

BoPeng commented Jul 28, 2019

This interface mimics the execution model of SCOOP, namely the non-cluster multi-node execution of workflows. It is added because PBS systems generate node files (albeit different formats) to specify the nodes for the execution of things on cluster, and sos is supposed to read the node files and start workers on remote nodes.

That means there is no need to differentiate cluster and non-cluster multi-node execution, and we can say

  1. -j 4 is the same as -j localhost:4 for local execution with specified number of workers
  2. -j node1 node2 node3 starts workers on node1, node2, and node3, utilizing default number of workers depending on cores of workers.
  3. -j node1:4 node3:4 node3:4 specifies number of worker processes on each node.
  4. -j @file uses parameters in file.
  5. without -j, we make use of something like ncores/2 (we have a more complex formula) processes on local host. And we use nodes specified in nodefile specified by PBS system if we are on a cluster system.

In all cases, the first node should be the "master" when the master process will be executed.

Now, this option will be used by both sos run and sos execute

sos run script -j node1 node2

to execute the entire workflow on multiple nodes. We can also put this in a PBS system, then the syntax would mostly like

%PBS nodes=5:ppn=5
sos run -q none

when nodefile is used implicitly.

The -j option for sos execute should mostly be kept unknown (for debug purpose perhaps), and be used as

%PBS nodes=5:ppn=5
sos execute task_id

to execute single multi-node task, or single multi-task master task.

BoPeng pushed a commit that referenced this issue Jul 29, 2019
@BoPeng BoPeng closed this as completed Aug 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants