Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added multi-GPU recipe. #27

Closed
wants to merge 1 commit into from
Closed

Conversation

berceanu
Copy link

Added recipe for running operations in parallel on multiple GPUs.

Added recipe for running operations in parallel on multiple GPUs.
@vyasr
Copy link
Contributor

vyasr commented Oct 3, 2019

@joaander I believe CUDA_VISIBLE_DEVICES will only work for certain configurations, correct? Is it the recommended solution on modern GPU configurations? Are there caveats we should include here?

Copy link
Member

@joaander joaander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, there is no general way to solve the problem of GPU scheduling.

There are some specific cases where one can schedule bundled signac-flow operations to GPUS in a mostly general way:

  • If you use the SLURM scheduler, you can use srun to schedule tasks to individual GPUs within a job
  • If you run on Summit, use jsrun

How to run parallel tasks on a multi-GPU machine
================================================
Using Using **signac-flow** on a multi-GPU system, via ``python project.py run --parallel``, all the parallel tasks will be sent to same GPU. In case one wants to run a single task per GPU, but use all GPUs at the same time, the solution is to use
``python project.py submit --bundle=N --parallel --test | /bin/bash``, where ``N`` is the number of (free) GPUs on the machine. The ``--test`` switch will generate a script which is then piped to the ``bash`` interpreter for execution. To check the script, one can redirect it to a file instead. For this recipe to work, your project folder must contain a ``templates/script.sh`` file with the following contents:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation will not choose among free GPUs, it will always choose GPUs 0,1,...N whether they are free or not.


{% set cmd_suffix = cmd_suffix|default('') ~ (' &' if parallel else '') %}
{% for operation in operations %}
export CUDA_VISIBLE_DEVICES={{ loop.index0 }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA_VISIBLE_DEVICES should not be used on systems that already make use of it for scheduling (i.e. SDSC Comet). It is less general, but a workable solution is to pass the loop index into the operation and select that GPU with the API of whatever tool the operation invokes.

@bdice
Copy link
Member

bdice commented Oct 10, 2019

Hey @berceanu, thanks for the PR. Unfortunately, there isn't a very general solution for this problem. In the future, it might be possible to suggest using a cluster-specific run utility where supported, e.g. srun on SLURM or jsrun on Summit. Those runner-utilities are aware of the resources they have available and will make appropriate decisions about GPU arrangements. Also, @joaander's advice to defer to the GPU application and allow it to select a GPU (based on the loop index) is another possible solution. Since CUDA_VISIBLE_DEVICES can cause problems (or override clusters' intended behavior), I am hesitant to give potentially bad advice to the user. We've run into a lot of problems over time with issues related to this on various clusters. For now, I recommend that we close this PR. Feel free to re-open if you have other ideas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants