Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

running on multiple GPU is very slow #572

Closed
201power opened this issue Feb 9, 2016 · 37 comments
Closed

running on multiple GPU is very slow #572

201power opened this issue Feb 9, 2016 · 37 comments

Comments

@201power
Copy link

201power commented Feb 9, 2016

I am trying to run 50-layer residual network with 4 K40m GPUs and it's very slow (same batch_size 16 as running on single GPU), take 6 hours for 1 epoch. However, If I run it on 1 GPU the speed is normal.

System: CentOS, digits v3, nvcaffe-0.14

BTW, I tried use Googlenet and it was ok on 4 GPUs.

Any suggestion or potential issue?

@lukeyeager
Copy link
Member

Hi @201power, there are several things that could be going on:

  1. cuDNN v4 is significantly faster than v3. That makes scaling across multiple GPUs less effective because a single GPU is just a lot quicker than it used to be and the GPU communication becomes more of a bottleneck. When NCCL gets integrated into Caffe, that should help some to speed up the cross-GPU communication.
  2. Some networks may be more conducive to multi-GPU than others. It's possible that the architecture of your ResNet requires more communication and less computation than other networks like GoogLeNet.
  3. Or this could be a bug. How much of a slowdown are we talking about? If you use one GPU how long does it take for 1 epoch?

@201power
Copy link
Author

201power commented Feb 9, 2016

Tried cuDNN v3, still the same speed, 1 epoch on 1 GPU take 4 mins, on 4 GPU take 6 hours.
I used the prototxt here:
https://github.com/201power/ResNet-Generator-for-caffe/blob/master/resnet50_trainval.prototxt

It contains two new layers, batch_norm and scale, which is not used in the default googlenet

@lukeyeager
Copy link
Member

When I try to run that network I get an error about Message type "caffe.LayerParameter" has no field named "scale_param".

NVcaffe 0.14 has the batch_norm layer type, but I guess you wrote this scale layer yourself?

@lukeyeager
Copy link
Member

Looks like you're getting the scale layer from BVLC/caffe#3591, right?

I'm assuming you hacked up DIGITS to allow BVLC/caffe? This is definitely not a supported configuration. Still happy to help if I can, though!

@201power
Copy link
Author

201power commented Feb 9, 2016

Yes, I merged NVcaffe 0.14 with caffe master branch, which has the scale layer:
https://github.com/201power/caffe

I think it's the issue with my caffe. I'll debug. Thanks!

BTW do nvcaffe plan to integrate scale layer?

@lukeyeager
Copy link
Member

I think it's the issue with my caffe. I'll debug. Thanks!

Let me know when you resolve this issue - I hope it will be a simple configuration problem, but I'd like to know for sure that it's not a cuDNN or multi-GPU bug.

BTW do nvcaffe plan to integrate scale layer?

NVcaffe will pick up the new scale layer (among other things) when we start working on our next release (see comment on release methodology here). We don't really pull in new features in-between releases - just major bugfixes.

@201power
Copy link
Author

201power commented Feb 9, 2016

Thanks. Now I am using original caffe-0.14 with googlenet, however, it because very slow as well.
Exact same job took 52 min using 4 GPU last time with exactly same software (digits/caffe/cudnn).
image

GPU status:
image

Also, I am experiencing soft lockup everytime I tried to abort a multi-GPU job in digits, do you get the same msg as well?

@lukeyeager
Copy link
Member

Hmm, that looks like trouble. How recent is your NVIDIA driver?

What happens when you try 2 GPUs instead of 1 or 4?

@201power
Copy link
Author

201power commented Feb 9, 2016

I tried 2 GPU and it's also very slow. The nvidia driver is 352.79
1 GPU always works fine.

Any other thing I can try/check?

@lukeyeager
Copy link
Member

I can go ask some people who would knowm more about this. A few questions before I do:

  1. You're using NVIDIA/caffe@caffe-0.14, right?
  2. You're using cuDNN v4 (4.0.4), right?
  3. Do you see the soft lockup error every time? Or does it only show up intermittently?

Can you try building without cuDNN at all and see if that makes a difference? I'm just spitballing here.

@201power
Copy link
Author

201power commented Feb 9, 2016

  1. Yes.
  2. Yes.
  3. Yes, every time I abort a multi-GPU job I saw a soft lockup (not for single GPU job).

I tried building caffe without cuDNN and it has the same issue, still very slow.
Thanks.

@thatguymike
Copy link

In the multigpu path there is sychronization between the GPUs that can cause delays, but I haven't seen those messages before. Do you have enough disk IO to feed all the GPUs? e.g. do you see messages about waiting for data in the log output?

@201power
Copy link
Author

I do see one waiting for data in caffe log, during loading the network. However, it seems does not cause delays.
image

I have 377G available disk space for the user, it should be fine.

@thatguymike
Copy link

Not a question of diskspace, but IO throughput. As you add GPUs, the pressure on disk goes up. However, I haven't seen soft lockup errors before. What motherboard chipset is this, e.g. Intel X99

@thatguymike
Copy link

What version of Linux and kernel revision? Anything in system logs from NVRM?

@201power
Copy link
Author

The motherboard chipset is supermicro x10drg-0t
Linux kernel: 3.10.0-229.el7.x86-64

@thatguymike
Copy link

K40s?

@201power
Copy link
Author

K40m. I checked /var/log/messages and here are (part of) things I found related to NVRM:

@lukeyeager
Copy link
Member

Can you try rolling back to your previously installed driver?

@201power
Copy link
Author

It's a fresh computer, so it's the only version of driver we installed.
We can try to use previous version of driver though.

When running digits with 4 GPU, it always has 2 GPU with 0% utilization, is this normal?
image

@lukeyeager
Copy link
Member

Don't worry about it then. So this was working for you earlier this week and you didn't uninstall or reinstall a driver or anything like that?

I'm working on trying to reproduce this bug from my end, but it may take me a while to get Caffe built on CentOS since I haven't done that before...

@201power
Copy link
Author

Yup, there is no change on the software/hardware since the last working job.
Yes, install caffe on centos is a little time consuming, make sure you use OpenBLAS.

@lukeyeager
Copy link
Member

And when you go back to your original build, GoogLeNet still works?

@lukeyeager
Copy link
Member

Quick smoketest on Ubuntu passed. The scaling is bad but I'm not seeing any NVRM lines in dmesg.

OS: Ubuntu 14.04
GPUs: 2 x K40c
Driver: 352.79
Caffe: NVcaffe v0.14.2
cuDNN: 4.0.7

1 GPU: 1min 53sec
2 GPU: 2min 29sec

Time to try it on CentOS...

@201power
Copy link
Author

The googlenet does not work anymore when I go back to the original build.

@lukeyeager
Copy link
Member

Smoketest on CentOS also passed.

OS: CentOS 7
GPUs: 2 x K40c
Driver: 352.79
Caffe: NVcaffe v0.14.2

1 GPU: 1min, 4sec
2 GPU: 50sec

(It's faster because I chose a different dataset and it's processing fewer images)

@201power
Copy link
Author

Is it possible that supermicro chipset causing the issue?

@lukeyeager
Copy link
Member

Let's investigate whether you might have a bad GPU. Can you try this:

  1. Do single-GPU training on each of your 4 GPUs
  2. Do double-GPU training for each pair between your 4 GPUs (6 combinations total)

@lukeyeager
Copy link
Member

Is it possible that supermicro chipset causing the issue?

Sure, that's possible too.

@201power
Copy link
Author

I actually have 8xK40m, training on single GPU is fine:

However, training on two GPUs become very slow for all GPUs.

@lukeyeager
Copy link
Member

And you say that multi-GPU was running at a reasonable speed just a few days ago? Can you think of anything relevant that you might have changed on the machine since then?

@lukeyeager
Copy link
Member

FYI, this Dockerfile is enough to get Caffe built on CentOS7 (I had to do a bunch more hackery to build pycaffe and DIGITS, though).

https://gist.github.com/lukeyeager/fc0b21a62fca7b3edb24

@201power
Copy link
Author

Nothing changed I can think of...just a reboot..
When I first installed caffe, it's running slow on multiple GPU. There is only 1 time I got it's running fast on 4 GPUs.

Thanks for the docker file.

@lukeyeager
Copy link
Member

Let's take this offline for now. I sent you an email.

@201power
Copy link
Author

I did not receive the email, did you sent to ?

@lukeyeager
Copy link
Member

In the end, this was a duplicate of NVIDIA/caffe#10.

@TimZaman
Copy link
Contributor

@lukeyeager Does the fact that we have to use an nvidia fork of Caffe mean that we cannot use vanilla ResNets with Caffe? Anyone got experience with that? (Resnet work fine with the latest torch from source)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants