-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
running on multiple GPU is very slow #572
Comments
Hi @201power, there are several things that could be going on:
|
Tried cuDNN v3, still the same speed, 1 epoch on 1 GPU take 4 mins, on 4 GPU take 6 hours. It contains two new layers, batch_norm and scale, which is not used in the default googlenet |
When I try to run that network I get an error about NVcaffe 0.14 has the |
Looks like you're getting the scale layer from BVLC/caffe#3591, right? I'm assuming you hacked up DIGITS to allow BVLC/caffe? This is definitely not a supported configuration. Still happy to help if I can, though! |
Yes, I merged NVcaffe 0.14 with caffe master branch, which has the scale layer: I think it's the issue with my caffe. I'll debug. Thanks! BTW do nvcaffe plan to integrate scale layer? |
Let me know when you resolve this issue - I hope it will be a simple configuration problem, but I'd like to know for sure that it's not a cuDNN or multi-GPU bug.
NVcaffe will pick up the new scale layer (among other things) when we start working on our next release (see comment on release methodology here). We don't really pull in new features in-between releases - just major bugfixes. |
Hmm, that looks like trouble. How recent is your NVIDIA driver? What happens when you try 2 GPUs instead of 1 or 4? |
I tried 2 GPU and it's also very slow. The nvidia driver is 352.79 Any other thing I can try/check? |
I can go ask some people who would knowm more about this. A few questions before I do:
Can you try building without cuDNN at all and see if that makes a difference? I'm just spitballing here. |
I tried building caffe without cuDNN and it has the same issue, still very slow. |
In the multigpu path there is sychronization between the GPUs that can cause delays, but I haven't seen those messages before. Do you have enough disk IO to feed all the GPUs? e.g. do you see messages about waiting for data in the log output? |
Not a question of diskspace, but IO throughput. As you add GPUs, the pressure on disk goes up. However, I haven't seen soft lockup errors before. What motherboard chipset is this, e.g. Intel X99 |
What version of Linux and kernel revision? Anything in system logs from NVRM? |
The motherboard chipset is supermicro x10drg-0t |
K40s? |
K40m. I checked /var/log/messages and here are (part of) things I found related to NVRM: |
Can you try rolling back to your previously installed driver? |
Don't worry about it then. So this was working for you earlier this week and you didn't uninstall or reinstall a driver or anything like that? I'm working on trying to reproduce this bug from my end, but it may take me a while to get Caffe built on CentOS since I haven't done that before... |
Yup, there is no change on the software/hardware since the last working job. |
And when you go back to your original build, GoogLeNet still works? |
Quick smoketest on Ubuntu passed. The scaling is bad but I'm not seeing any NVRM lines in OS: Ubuntu 14.04 1 GPU: 1min 53sec Time to try it on CentOS... |
The googlenet does not work anymore when I go back to the original build. |
Smoketest on CentOS also passed. OS: CentOS 7 1 GPU: 1min, 4sec (It's faster because I chose a different dataset and it's processing fewer images) |
Is it possible that supermicro chipset causing the issue? |
Let's investigate whether you might have a bad GPU. Can you try this:
|
Sure, that's possible too. |
I actually have 8xK40m, training on single GPU is fine: However, training on two GPUs become very slow for all GPUs. |
And you say that multi-GPU was running at a reasonable speed just a few days ago? Can you think of anything relevant that you might have changed on the machine since then? |
FYI, this Dockerfile is enough to get Caffe built on CentOS7 (I had to do a bunch more hackery to build pycaffe and DIGITS, though). |
Nothing changed I can think of...just a reboot.. Thanks for the docker file. |
Let's take this offline for now. I sent you an email. |
I did not receive the email, did you sent to ? |
In the end, this was a duplicate of NVIDIA/caffe#10. |
@lukeyeager Does the fact that we have to use an nvidia fork of Caffe mean that we cannot use vanilla ResNets with Caffe? Anyone got experience with that? (Resnet work fine with the latest torch from source) |
I am trying to run 50-layer residual network with 4 K40m GPUs and it's very slow (same batch_size 16 as running on single GPU), take 6 hours for 1 epoch. However, If I run it on 1 GPU the speed is normal.
System: CentOS, digits v3, nvcaffe-0.14
BTW, I tried use Googlenet and it was ok on 4 GPUs.
Any suggestion or potential issue?
The text was updated successfully, but these errors were encountered: