42 multi gpu #49

ericspod · 2020-02-04T16:04:52Z

This adds a function for creating a supervised learner and evaluator in the same style as Ignite. An example notebook and unit tests are provided. The tests will not execute a multi-GPU test correctly if the host has only one GPU, a way to emulate multiple GPUs would be helpful.

…ion with multiple GPUs using data parallelism

wyli · 2020-02-04T16:27:56Z

since the gpu-based CI is in place, could remove the non-gpu CI previously configured here:

MONAI/.github/workflows/pythonapp.yml

Lines 28 to 30 in 7b2d73f

    
               - name: Test and coverage 
        
                 run: | 
        
                   ./runtests.sh --coverage

anfeng · 2020-02-04T21:54:19Z

monai/application/engine/multi_gpu_supervised_trainer.py

+
+    if devices is None:
+        devices = [torch.device('cuda:%i' % d) for d in range(torch.cuda.device_count())]
+    elif len(devices) == 0:


You should change this to "if" instead of "elif".

Consider the case that a machine don't have GPU, and devices was None in the call. Your statement in L52 will return an empty array. This will cause problem at L59 for "devices[0]"

Hi @anfeng and @ericspod ,

As this file is "multi_gpu_supervised_trainer.py", I think we should only support Multi-GPU here.
Just assert something like:

if devices is None: devices = [torch.device('cuda:%i' % d) for d in range(torch.cuda.device_count())] assert len(devices) > 1, 'must have more 1 GPU devices.' ... ...

What do you think?
Thanks.

In the case where there are not GPUs but they were requested, the list will be empty and the failure will occur on line 59. Leaving an error to propagate elsewhere is sort of the Pythonic way but instead we should raise an error. Assert isn't appropriate here as we're checking input essentially rather than an internal property our own logic should enforce if correct. I've committed changes to reflect this. I've also added code to the multi-GPU unit test to suppress the warning about GPU memory imbalance, I'd say unit tests should be silent unless they fail but maybe the warning should still be allowed or at least logged.

@Nic-Ma It's harmless to support only 1 GPU, if people have code parametrized by the number of GPUs to use they'd have to choose which function to call based on a check if that was 1 or not, it's easier if this isn't restricted in that way.

Hi @ericspod ,

Thanks for your explanation.
If you want to suppport both CPU and GPU/GPUs logics here, why not rename the function to some general-purpose trainer? Maybe:

def create_supervised_trainer(devices): if devices is None: devices = [torch.device('cuda:%i' % d) for d in range(torch.cuda.device_count())] if len(devices) == 0: devices = [torch.device("cpu")] ... ...

And about your new util function get_devices_spec(), I don't think it's a good idea to use empty list as parameter for CPU device, it's confusing.Maybe add another flag for CPU/GPU directly?
Thanks.

@Nic-Ma What do you suggest for parameter instead?

We can use strings like "multi-gpu", "gpu", "cpu" but if the user passes "cuda:0" what do we do?. We can enforce only using our defined names but that seems too restrictive.

Hi @madil90 and @ericspod ,

Thanks for your comments, I prefer to this strategy:

def create_supervised_trainer(devices=None): if devices is None: devices = [torch.device('cuda:%i' % d) for d in range(torch.cuda.device_count())] if len(devices) == 0: devices = [torch.device("cpu")] else: # use devices parameter directly. ... ... # use cases: trainer = create_supervised_trainer() # automatically select devices trainer = create_supervised_trainer(devices=[torch.device("cpu")]) trainer = create_supervised_trainer(devices=[torch.device("cuda:0")]) trainer = create_supervised_trainer(devices=[torch.device("cuda:0"), torch.device("cuda:1")])

If user pass something through "devices" parameter, let's use it directly.
We can add some sanity check in later version.

If no devices provided, we try to use all GPUs first, if no GPU found, use CPU instead.

What do you think?
Thanks.

@Nic-Ma A little verbose but seems fine. However, the last one becomes redundant. PyTorch will create a parallel context on all GPUs. The responsibility is on user to select GPUs through CUDA_VISIBLE_DEVICES.

Hi @madil90 , Yes, thanks for your reminder.
I just want to share some alternative proposals.
For the MVP version, I am OK to use @ericspod 's method for CPU device(empty arrary).

Hi @ericspod , I think you can review Adil's PR for your branch and make the final solution.
Thanks.

For the name, I used a different name to not clash with the one from Ignite, it should be obvious which is being used when looking at source code.

Choosing to use CPU computation silently when no GPU is present is going to cause people to use CPU when they didn't expect it. There should be a loud and clear error when something requested isn't possible.

madil90 · 2020-02-04T22:53:28Z

monai/application/engine/multi_gpu_supervised_trainer.py

+
+    if devices is None:
+        devices = [torch.device('cuda:%i' % d) for d in range(torch.cuda.device_count())]
+    elif len(devices) == 0:


@anfeng comment for L53 applies here too. Perhaps we should move this logic to some util function?

Nic-Ma

Added comments inline.
Thanks.

madil90 · 2020-02-05T23:37:44Z

monai/application/engine/multi_gpu_supervised_trainer.py

+    if devices is None:
+        devices = [torch.device('cuda:%i' % d) for d in range(torch.cuda.device_count())]
+
+        if len(devices) == 0:


If no GPU is found, we should default to CPU.

As I mentioned elsewhere, defaulting to CPU like that will cause silent errors when people expect to use GPUs. If people want GPUs they should get a loud and clear error that they can't get them, otherwise they'll think everything is find just super slow.

Nic-Ma · 2020-02-07T01:26:35Z

monai/application/engine/multi_gpu_supervised_trainer.py

+            raise ValueError("No GPU devices available")
+
+    elif len(devices) == 0:
+        devices = [torch.device("cpu")]


Just suggest to print a warning here if use CPU instead.
Because this code file is for "multi_gpu_trainer", what do you think?
People may don't know that "devices = empty list" is "CPU device".
Others look good to me.
Thanks.

ericspod added 3 commits January 31, 2020 15:40

Adding adaptation to use the default Ignite supervised training funct…

ac00686

…ion with multiple GPUs using data parallelism

Update test_parallel_execution.py

5ca8722

Ignite version change to 0.3.0.

8111659

anfeng closed this Feb 4, 2020

anfeng reopened this Feb 4, 2020

anfeng reviewed Feb 4, 2020

View reviewed changes

madil90 reviewed Feb 4, 2020

View reviewed changes

Nic-Ma requested changes Feb 5, 2020

View reviewed changes

Slight tweak to parallel code

94b7949

madil90 reviewed Feb 5, 2020

View reviewed changes

wyli mentioned this pull request Feb 6, 2020

separate linting and runtests pipelines #54

Merged

Merge branch 'master' into 42-multi-gpu

337516e

ericspod requested a review from Nic-Ma February 6, 2020 15:19

Update test_parallel_execution.py

34fd93a

Nic-Ma reviewed Feb 7, 2020

View reviewed changes

Nic-Ma approved these changes Feb 7, 2020

View reviewed changes

Add decorator to expect failure if no GPUs. (#51)

463affe

wyli merged commit 962cb11 into master Feb 7, 2020

wyli mentioned this pull request Feb 7, 2020

MVP example for multiple GPU training #42

Closed

wyli deleted the 42-multi-gpu branch April 6, 2020 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

42 multi gpu #49

42 multi gpu #49

ericspod commented Feb 4, 2020

wyli commented Feb 4, 2020

anfeng Feb 4, 2020

Nic-Ma Feb 5, 2020

ericspod Feb 5, 2020

Nic-Ma Feb 5, 2020

madil90 Feb 6, 2020

Nic-Ma Feb 6, 2020

madil90 Feb 6, 2020

Nic-Ma Feb 6, 2020

ericspod Feb 6, 2020

madil90 Feb 4, 2020

Nic-Ma left a comment

madil90 Feb 5, 2020

ericspod Feb 6, 2020

Nic-Ma Feb 7, 2020 •

edited

Loading

42 multi gpu #49

42 multi gpu #49

Conversation

ericspod commented Feb 4, 2020

wyli commented Feb 4, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nic-Ma left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nic-Ma Feb 7, 2020 • edited Loading

Choose a reason for hiding this comment

Nic-Ma Feb 7, 2020 •

edited

Loading