Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor directives. #785

Open
joaander opened this issue Nov 3, 2023 · 13 comments
Open

Refactor directives. #785

joaander opened this issue Nov 3, 2023 · 13 comments
Assignees

Comments

@joaander
Copy link
Member

joaander commented Nov 3, 2023

Feature description

  • Refactor the job directives for logical consistency, to use standard terms, and to better support modern batch schedulers.
  • Complete the ongoing refactor of the job templates, move as much as possible into the base slurm template. System specific templates should be minimal, if needed at all.

Proposed solution

Replace directives with the new schema:

New schema Description
executable Same as previous executable.
walltime Same as previous walltime.
launcher Choose which launcher to use: None (the default) or 'mpi'.
processes The number of processes to execute. Equivalant to the previous np when launcher is None and nranks when launcher == 'mpi'.
threads_per_process Replaces the previous omp_num_threads with a more general term. Flow will always set OMP_NUM_THREADS when threads_per_process is greater than 1.
gpus_per_process Number of gpus to schedule per process. Replaces the previous aggregate gpu.
memory_per_cpu The amount of memory used per CPU thread. Replaces the previous aggregate memory with a more naturally expressible quantity and one that is easier to set appropriately based on the machine configuration.

processor_fraction is not present in the new schema. It is not implementable in any batch scheduler currently in production use. If users desire to oversubscribe resources with many short tasks, they can use an interactive job and run --parallel.

fork should also be removed. Flow automatically decides to fork when needed.

Additional context

This design would solve #777, provide a more understandable schema for selecting resources, and reduce the effort needed to develop future cluster job templates.

When launcher is None:

  • Flow will request the defined resources, but distributing processes, threads, memory, and gpus to the appropriate resources is left to the application.
  • Flow will error when more than 1 node is requested. A launcher is required to distribute processes to multiple nodes.
  • Both serial and parallel bundles are typically supported. Flow will error when asked to launch parallel bundles on machines known to enable aggressive core binding in the job's main shell process (unless we can disable that binding in the template script).

When launcher == 'mpi':

  • Flow will request the defined resources and ask srun, mpirun, or the appropriate machine specific MPI launcher to distribute processes, threads, memory, and gpus to the appropriate resources.
  • Serial bundles are supported when all operations in the bundle have identical values for launcher, processes, threads_per_process, gpus_per_process, and memory_per_cpu. Flow will raise an error for any invocation of --bundle --parallel.

launcher is a string to allow for potential future expansion to some non-MPI launcher capable of distributing processes to multiple nodes: see #220.

This refactor solves issues discussed in #777, #455, #115, #235.

@joaander
Copy link
Member Author

joaander commented Nov 3, 2023

With #784, we could store the maximum memory per cpu that a partition allows without allocating extra CPUs and use that information to provide the user with an error.

Alternately, we could remove memory_per_cpu from this proposal and replace it with an automatic request for the maximum allowed on that partition. I can think of no use-case where it is practical to request less than the maximum. Users typically only need to set memory currently on systems such as Great Lakes where the default is significantly smaller than the maximum. We could make that default be the maximum on systems where it is not.

@joaander
Copy link
Member Author

joaander commented Nov 3, 2023

Here are some example directives.

# serial
directives = {'processes': 1}

# multiprocessing and/ or threaded app on a single node
directives = {'processes': 8}
directives = {'processes': 1, 'threads_per_process': 8}

# OpenMP application on a single node
directives = {'processes': 1, 'threads_per_process': 8}

# GPU application on a single node
directives = {'processes': 1, 'gpus_per_process': 1}

# MPI application on 1 or more nodes
directives = {'processes': 512, 'launcher': 'mpi'}
directives = {'processes': 512, 'gpus_per_process': 1, 'launcher': 'mpi'}

# Hybrid MPI/OpenMP application on 1 or more nodes
directives = {'processes': 8, 'threads_per_process': 64, 'launcher': 'mpi'}

@b-butler b-butler self-assigned this Nov 3, 2023
@bdice
Copy link
Member

bdice commented Nov 4, 2023

@joaander I want to give my vote of support for this idea. The landscape of HPC clusters has continued to evolve since I was last actively involved in signac-flow's cluster templates. It seems things have solidified a bit more around core concepts and "directives" that are aligned with the above proposal. I am also generally appreciative and supportive of you proposing and pursuing significant changes like this. 👍

@joaander
Copy link
Member Author

joaander commented Nov 6, 2023

@bdice Thank you for reviewing the proposal and your positive comments.

@joaander
Copy link
Member Author

joaander commented Nov 6, 2023

With #784, we could store the maximum memory per cpu that a partition allows without allocating extra CPUs and use that information to provide the user with an error.

To support this on GPU partitions we would also need a memory_per_gpu directive and to know the total memory per GPU on each GPU partition. The alternative would be to not attempt to warn users about memory usage and instead expect users to correctly set memory_per_cpu on GPU partitions in a way commensurate with their usage. For example with 64 GB available per GPU:

directives = {'processes': 1, 'gpus_per_process': 1, 'memory_per_cpu': '64g'}
directives = {'processes': 1, 'threads_per_process': 8, 'gpus_per_process': 1, 'memory_per_cpu': '8g'}

vs.

directives = {'processes': 1, 'gpus_per_process': 1, 'memory_per_gpu': '64g'}
directives = {'processes': 1, 'threads_per_process': 8, 'gpus_per_process': 1, 'memory_per_gpu': '64g'}

In SLURM, --mem-per-cpu and --mem-per-gpu are mutually exclusive.

@bcrawford39GT
Copy link

bcrawford39GT commented Nov 6, 2023

@joaander Thanks for adding this, as it will help support the Georgia Tech HPCs!

For Georgia Tech HPCs the --mem-per-cpu, --mem-per-gpu, and --mem are mutually exclusive. I am not aware that we have a --gpus_per_process, but if we can just omit that if needed that would also be great!

@b-butler
Copy link
Member

b-butler commented Nov 6, 2023

@joaander Thanks for adding this, as it will help support the Georgia Tech HPCs!

For Georgia Tech HPCs the --mem-per-cpu, --mem-per-gpu, and --mem are mutually exclusive. I am not aware that we have a --gpus_per_process, but if we can just omit that if needed that would also be great!

@bcrawford39GT gpus_per_process isn't expected to be a scheduler setting here. It is an abstract request that flow instantiates into the appropriate commands for submission script.

@b-butler b-butler mentioned this issue Nov 6, 2023
6 tasks
@cbkerr
Copy link
Member

cbkerr commented Nov 9, 2023

Thank you! I was always confused by differences in how flow does things and how user guides for SLURM etc describe things...processes, threads, ranks, oh my!

Alternately, we could remove memory_per_cpu from this proposal and replace it with an automatic request for the maximum allowed on that partition. I can think of no use-case where it is practical to request less than the maximum. Users typically only need to set memory currently on systems such as Great Lakes where the default is significantly smaller than the maximum. We could make that default be the maximum on systems where it is not.

If there is usually no cost in amount of memory requested, I highly support this change to make using flow easier for users. I know people who have had jobs confusingly canceled due to running out of memory.

Flow could print that it automatically selected the maximum allowed for the allocation, for instance:

Using environment configuration: Bridges2Environment
Selected max allowable memory $N GB per CPU.

@bcrawford39GT
Copy link

bcrawford39GT commented Nov 10, 2023

@cbkerr Signac should likely support the --memory_per_cpu, as it is available in Slurm, and it will minimize maintenance later. Georgia Tech's system allows it. People may run heterogenous processes with different CPU core numbers, and want to spec out their RAM on a per CPU basis.

There may be processes that need this. Additionally, there may be a cost to asking for more RAM, because they may charge you for it, depending on the HPC system or Cloud compute system you are using.

For Georgia Tech HPCs the --mem-per-cpu, --mem-per-gpu, and --mem are mutually exclusive.

@joaander
Copy link
Member Author

There may be processes that need this. Additionally, there may be a cost to asking for more RAM, because they may charge you for it, depending on the HPC system or Cloud compute system you are using.

We discussed this offline. We plan to make memory_per_cpu and memory_per_gpu available BUT by default set them to the maximum allowable by the selected partition without incurring extra charges. This default behavior should suit the vast majority of users. Specific environments may choose to not set these defaults if desired (e.g. a memory request is unnecessary on whole node jobs).

Users that request more than the maximum will not only incur extra charges, but may also result in broken slurm scripts. For example, I recently tested Purdue Anvil with --ntasks=16, --mem-per-cpu=2g. It turns out that the maximum memory is 1918m and SLURM thus assigned my job to 18 cores. In this configuration both mpirun -n 16 and srun -n 16 were not able to bind ranks to the expected 16 cores and threw errors.

Note that because Anvil automatically scales the CPU request with the memory request, there is no reason to ever request anything less than the maximum. By doing so, you risk out of memory errors in your job. The same goes on Bridges-2 which errors at submission time when you request more than the maximum.

On systems that both default to less than the maximum and allow users to oversubscribe memory and undersubscribe CPUs (Georgia Tech, UMich Great Lakes, Expanse shared queue), users may wish to request less than the maximum (without incurring extra charges). However, the best a user can ever hope to achieve by this is gain some goodwill with the rest of the system's user community - especially those that request more than the maximum.

For Georgia Tech HPCs the --mem-per-cpu, --mem-per-gpu, and --mem are mutually exclusive.

Yes, this is standard SLURM behavior. I do not recommend the use of --mem at all in signac-flow. It is a per node quantity, and flow does not know (in all cases) at submission time exactly how many nodes that SLURM will eventually schedule the job to.

@joaander
Copy link
Member Author

Flow could print that it automatically selected the maximum allowed for the allocation, for instance:

Using environment configuration: Bridges2Environment
Selected max allowable memory $N GB per CPU.

Please, only when verbose output is requested - if at all. This information is in the --pretend output and can be verified there when needed.

@bcrawford39GT
Copy link

@joaander Yeah, I think what you are saying makes sense. Just need to get rid of Signac's auto printing of --mem or a way to remove it from auto-printing, and it should be good.

@joaander
Copy link
Member Author

The syntax describe here informed the design of row.

Work on implementing this in flow is started in #819. I have no plans to finish the work that was started myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants