Added multi-GPU recipe. #27

berceanu · 2019-05-24T18:55:19Z

Added recipe for running operations in parallel on multiple GPUs.

vyasr · 2019-10-03T16:20:54Z

@joaander I believe CUDA_VISIBLE_DEVICES will only work for certain configurations, correct? Is it the recommended solution on modern GPU configurations? Are there caveats we should include here?

joaander

Unfortunately, there is no general way to solve the problem of GPU scheduling.

There are some specific cases where one can schedule bundled signac-flow operations to GPUS in a mostly general way:

If you use the SLURM scheduler, you can use srun to schedule tasks to individual GPUs within a job
If you run on Summit, use jsrun

joaander · 2019-10-03T18:21:21Z

docs/source/recipes.rst

+How to run parallel tasks on a multi-GPU machine
+================================================
+Using Using **signac-flow** on a multi-GPU system, via ``python project.py run --parallel``, all the parallel tasks will be sent to same GPU. In case one wants to run a single task per GPU, but use all GPUs at the same time, the solution is to use
+``python project.py submit --bundle=N --parallel --test | /bin/bash``, where ``N`` is the number of (free) GPUs on the machine. The ``--test`` switch will generate a script which is then piped to the ``bash`` interpreter for execution. To check the script, one can redirect it to a file instead. For this recipe to work, your project folder must contain a ``templates/script.sh`` file with the following contents:


This implementation will not choose among free GPUs, it will always choose GPUs 0,1,...N whether they are free or not.

joaander · 2019-10-03T18:22:41Z

docs/source/recipes.rst

+
+    {% set cmd_suffix = cmd_suffix|default('') ~ (' &' if parallel else '') %}
+    {% for operation in operations %}
+    export CUDA_VISIBLE_DEVICES={{ loop.index0 }}


CUDA_VISIBLE_DEVICES should not be used on systems that already make use of it for scheduling (i.e. SDSC Comet). It is less general, but a workable solution is to pass the loop index into the operation and select that GPU with the API of whatever tool the operation invokes.

bdice · 2019-10-10T17:35:18Z

Hey @berceanu, thanks for the PR. Unfortunately, there isn't a very general solution for this problem. In the future, it might be possible to suggest using a cluster-specific run utility where supported, e.g. srun on SLURM or jsrun on Summit. Those runner-utilities are aware of the resources they have available and will make appropriate decisions about GPU arrangements. Also, @joaander's advice to defer to the GPU application and allow it to select a GPU (based on the loop index) is another possible solution. Since CUDA_VISIBLE_DEVICES can cause problems (or override clusters' intended behavior), I am hesitant to give potentially bad advice to the user. We've run into a lot of problems over time with issues related to this on various clusters. For now, I recommend that we close this PR. Feel free to re-open if you have other ideas.

Added multi-GPU recipe.

0b64fa0

Added recipe for running operations in parallel on multiple GPUs.

berceanu mentioned this pull request Jun 5, 2019

CUDA initialized before forking glotzerlab/signac-flow#115

Closed

joaander reviewed Oct 3, 2019

View reviewed changes

bdice closed this Oct 10, 2019

cbkerr mentioned this pull request Dec 15, 2021

Clarify limits of submit --bundle --parallel #157

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added multi-GPU recipe. #27

Added multi-GPU recipe. #27

berceanu commented May 24, 2019

vyasr commented Oct 3, 2019

joaander left a comment

joaander Oct 3, 2019

joaander Oct 3, 2019

bdice commented Oct 10, 2019

Added multi-GPU recipe. #27

Added multi-GPU recipe. #27

Conversation

berceanu commented May 24, 2019

vyasr commented Oct 3, 2019

joaander left a comment

Choose a reason for hiding this comment

joaander Oct 3, 2019

Choose a reason for hiding this comment

joaander Oct 3, 2019

Choose a reason for hiding this comment

bdice commented Oct 10, 2019