running on multiple GPU is very slow #572

201power · 2016-02-09T18:32:44Z

I am trying to run 50-layer residual network with 4 K40m GPUs and it's very slow (same batch_size 16 as running on single GPU), take 6 hours for 1 epoch. However, If I run it on 1 GPU the speed is normal.

System: CentOS, digits v3, nvcaffe-0.14

BTW, I tried use Googlenet and it was ok on 4 GPUs.

Any suggestion or potential issue?

lukeyeager · 2016-02-09T18:44:41Z

Hi @201power, there are several things that could be going on:

cuDNN v4 is significantly faster than v3. That makes scaling across multiple GPUs less effective because a single GPU is just a lot quicker than it used to be and the GPU communication becomes more of a bottleneck. When NCCL gets integrated into Caffe, that should help some to speed up the cross-GPU communication.
Some networks may be more conducive to multi-GPU than others. It's possible that the architecture of your ResNet requires more communication and less computation than other networks like GoogLeNet.
Or this could be a bug. How much of a slowdown are we talking about? If you use one GPU how long does it take for 1 epoch?

201power · 2016-02-09T19:07:26Z

Tried cuDNN v3, still the same speed, 1 epoch on 1 GPU take 4 mins, on 4 GPU take 6 hours.
I used the prototxt here:
https://github.com/201power/ResNet-Generator-for-caffe/blob/master/resnet50_trainval.prototxt

It contains two new layers, batch_norm and scale, which is not used in the default googlenet

lukeyeager · 2016-02-09T19:21:10Z

When I try to run that network I get an error about Message type "caffe.LayerParameter" has no field named "scale_param".

NVcaffe 0.14 has the batch_norm layer type, but I guess you wrote this scale layer yourself?

lukeyeager · 2016-02-09T19:29:15Z

Looks like you're getting the scale layer from BVLC/caffe#3591, right?

I'm assuming you hacked up DIGITS to allow BVLC/caffe? This is definitely not a supported configuration. Still happy to help if I can, though!

201power · 2016-02-09T19:31:20Z

Yes, I merged NVcaffe 0.14 with caffe master branch, which has the scale layer:
https://github.com/201power/caffe

I think it's the issue with my caffe. I'll debug. Thanks!

BTW do nvcaffe plan to integrate scale layer?

lukeyeager · 2016-02-09T19:41:20Z

I think it's the issue with my caffe. I'll debug. Thanks!

Let me know when you resolve this issue - I hope it will be a simple configuration problem, but I'd like to know for sure that it's not a cuDNN or multi-GPU bug.

BTW do nvcaffe plan to integrate scale layer?

NVcaffe will pick up the new scale layer (among other things) when we start working on our next release (see comment on release methodology here). We don't really pull in new features in-between releases - just major bugfixes.

201power · 2016-02-09T21:56:21Z

Thanks. Now I am using original caffe-0.14 with googlenet, however, it because very slow as well.
Exact same job took 52 min using 4 GPU last time with exactly same software (digits/caffe/cudnn).

GPU status:

Also, I am experiencing soft lockup everytime I tried to abort a multi-GPU job in digits, do you get the same msg as well?

lukeyeager · 2016-02-09T22:16:00Z

Hmm, that looks like trouble. How recent is your NVIDIA driver?

What happens when you try 2 GPUs instead of 1 or 4?

201power · 2016-02-09T22:30:49Z

I tried 2 GPU and it's also very slow. The nvidia driver is 352.79
1 GPU always works fine.

Any other thing I can try/check?

lukeyeager · 2016-02-09T22:38:59Z

I can go ask some people who would knowm more about this. A few questions before I do:

You're using NVIDIA/caffe@caffe-0.14, right?
You're using cuDNN v4 (4.0.4), right?
Do you see the soft lockup error every time? Or does it only show up intermittently?

Can you try building without cuDNN at all and see if that makes a difference? I'm just spitballing here.

201power · 2016-02-09T23:14:41Z

Yes.
Yes.
Yes, every time I abort a multi-GPU job I saw a soft lockup (not for single GPU job).

I tried building caffe without cuDNN and it has the same issue, still very slow.
Thanks.

thatguymike · 2016-02-10T00:05:55Z

In the multigpu path there is sychronization between the GPUs that can cause delays, but I haven't seen those messages before. Do you have enough disk IO to feed all the GPUs? e.g. do you see messages about waiting for data in the log output?

201power · 2016-02-10T00:13:19Z

I do see one waiting for data in caffe log, during loading the network. However, it seems does not cause delays.

I have 377G available disk space for the user, it should be fine.

thatguymike · 2016-02-10T00:19:48Z

Not a question of diskspace, but IO throughput. As you add GPUs, the pressure on disk goes up. However, I haven't seen soft lockup errors before. What motherboard chipset is this, e.g. Intel X99

thatguymike · 2016-02-10T00:29:50Z

What version of Linux and kernel revision? Anything in system logs from NVRM?

201power · 2016-02-10T00:31:46Z

The motherboard chipset is supermicro x10drg-0t
Linux kernel: 3.10.0-229.el7.x86-64

thatguymike · 2016-02-10T00:36:06Z

K40s?

201power · 2016-02-10T00:36:27Z

K40m. I checked /var/log/messages and here are (part of) things I found related to NVRM:

lukeyeager · 2016-02-10T01:03:10Z

Can you try rolling back to your previously installed driver?

201power · 2016-02-10T01:26:00Z

It's a fresh computer, so it's the only version of driver we installed.
We can try to use previous version of driver though.

When running digits with 4 GPU, it always has 2 GPU with 0% utilization, is this normal?

lukeyeager · 2016-02-10T01:29:40Z

Don't worry about it then. So this was working for you earlier this week and you didn't uninstall or reinstall a driver or anything like that?

I'm working on trying to reproduce this bug from my end, but it may take me a while to get Caffe built on CentOS since I haven't done that before...

201power · 2016-02-10T01:33:42Z

Yup, there is no change on the software/hardware since the last working job.
Yes, install caffe on centos is a little time consuming, make sure you use OpenBLAS.

lukeyeager · 2016-02-10T01:39:54Z

And when you go back to your original build, GoogLeNet still works?

lukeyeager · 2016-02-10T01:58:41Z

Quick smoketest on Ubuntu passed. The scaling is bad but I'm not seeing any NVRM lines in dmesg.

OS: Ubuntu 14.04
GPUs: 2 x K40c
Driver: 352.79
Caffe: NVcaffe v0.14.2
cuDNN: 4.0.7

1 GPU: 1min 53sec
2 GPU: 2min 29sec

Time to try it on CentOS...

201power · 2016-02-10T03:56:35Z

The googlenet does not work anymore when I go back to the original build.

lukeyeager · 2016-02-10T17:32:42Z

Smoketest on CentOS also passed.

OS: CentOS 7
GPUs: 2 x K40c
Driver: 352.79
Caffe: NVcaffe v0.14.2

1 GPU: 1min, 4sec
2 GPU: 50sec

(It's faster because I chose a different dataset and it's processing fewer images)

201power · 2016-02-10T17:42:37Z

Is it possible that supermicro chipset causing the issue?

lukeyeager · 2016-02-10T17:46:45Z

Let's investigate whether you might have a bad GPU. Can you try this:

Do single-GPU training on each of your 4 GPUs
Do double-GPU training for each pair between your 4 GPUs (6 combinations total)

lukeyeager · 2016-02-10T17:47:58Z

Is it possible that supermicro chipset causing the issue?

Sure, that's possible too.

201power · 2016-02-10T18:11:15Z

I actually have 8xK40m, training on single GPU is fine:

However, training on two GPUs become very slow for all GPUs.

lukeyeager · 2016-02-10T18:19:04Z

And you say that multi-GPU was running at a reasonable speed just a few days ago? Can you think of anything relevant that you might have changed on the machine since then?

lukeyeager · 2016-02-10T18:30:22Z

FYI, this Dockerfile is enough to get Caffe built on CentOS7 (I had to do a bunch more hackery to build pycaffe and DIGITS, though).

https://gist.github.com/lukeyeager/fc0b21a62fca7b3edb24

201power · 2016-02-10T18:58:22Z

Nothing changed I can think of...just a reboot..
When I first installed caffe, it's running slow on multiple GPU. There is only 1 time I got it's running fast on 4 GPUs.

Thanks for the docker file.

lukeyeager · 2016-02-10T19:38:44Z

Let's take this offline for now. I sent you an email.

201power · 2016-02-10T21:06:54Z

I did not receive the email, did you sent to ?

lukeyeager · 2016-02-11T01:05:47Z

In the end, this was a duplicate of NVIDIA/caffe#10.

TimZaman · 2016-04-28T10:02:56Z

@lukeyeager Does the fact that we have to use an nvidia fork of Caffe mean that we cannot use vanilla ResNets with Caffe? Anyone got experience with that? (Resnet work fine with the latest torch from source)

lukeyeager added the caffe label Feb 9, 2016

lukeyeager added duplicate and removed caffe labels Feb 11, 2016

lukeyeager closed this as completed Feb 11, 2016

running on multiple GPU is very slow #572

running on multiple GPU is very slow #572

Comments

201power commented Feb 9, 2016

lukeyeager commented Feb 9, 2016

201power commented Feb 9, 2016

lukeyeager commented Feb 9, 2016

lukeyeager commented Feb 9, 2016

201power commented Feb 9, 2016

lukeyeager commented Feb 9, 2016

201power commented Feb 9, 2016

lukeyeager commented Feb 9, 2016

201power commented Feb 9, 2016

lukeyeager commented Feb 9, 2016

201power commented Feb 9, 2016

thatguymike commented Feb 10, 2016

201power commented Feb 10, 2016

thatguymike commented Feb 10, 2016

thatguymike commented Feb 10, 2016

201power commented Feb 10, 2016

thatguymike commented Feb 10, 2016

201power commented Feb 10, 2016

lukeyeager commented Feb 10, 2016

201power commented Feb 10, 2016

lukeyeager commented Feb 10, 2016

201power commented Feb 10, 2016

lukeyeager commented Feb 10, 2016

lukeyeager commented Feb 10, 2016

201power commented Feb 10, 2016

lukeyeager commented Feb 10, 2016

201power commented Feb 10, 2016

lukeyeager commented Feb 10, 2016

lukeyeager commented Feb 10, 2016

201power commented Feb 10, 2016

lukeyeager commented Feb 10, 2016

lukeyeager commented Feb 10, 2016

201power commented Feb 10, 2016

lukeyeager commented Feb 10, 2016

201power commented Feb 10, 2016

lukeyeager commented Feb 11, 2016

TimZaman commented Apr 28, 2016