Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out-of-memory on g2.8xlarge #34

Closed
lukeyeager opened this issue Sep 15, 2015 · 4 comments
Closed

Out-of-memory on g2.8xlarge #34

lukeyeager opened this issue Sep 15, 2015 · 4 comments
Labels

Comments

@lukeyeager
Copy link
Member

See NVIDIA/DIGITS#310.

/cc @ajsander

I've trained a couple models (Alexnet and GoogleNet) using DIGITS successfully with statistics shown for test and validation accuracy, but when I try to classify a single image using the web interface I get the following error:

WARNING: Logging before InitGoogleLogging() is written to STDERR
F0915 14:10:45.809661 98789 common.cpp:266] Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***

When I check nvidia-smi it appears that it the amount of memory is increasing by around 100MB but it's still nowhere near the full memory capacity of the card at 3GB.
NVIDIA/DIGITS#310 (comment)

Here is some information about his system:

Running on an Amazon g2.8xlarge
GPU[s]: 4x GRID K520
CUDA 7.0
cuDNN 7.0
Caffe version 0.12 NVIDIA fork
DIGITS 2.1

Both Alexnet and GoogleNet Experienced the same problem
NVIDIA/DIGITS#310 (comment)

Here's how I reproduced it:

  1. Start up a Ubuntu 14.04 on g2.8xlarge EC2 instance
  2. Install the 346 driver
  3. Installed DIGITS 2.0 and Caffe 0.13.1 (with CNMeM) using the web installer
  4. Create a small dataset of 256x256 images
  5. Train AlexNet on it
  6. Try to classify an image

The big question

Why would we run out of memory during inference but not while training?

@kklemon
Copy link

kklemon commented Sep 19, 2015

I also get the same issue when trying to classify an image at very low memory load.

My card is a GTX 660 Ti with 2 GB, but the memory usage when running into the error is only about 10%.

System is Ubuntu 15.04 and i'm using the most recent version of DIGITS and nvidia's caffe fork. Both is compiled manuelly without the web installer.

The error is the same as described above:

F0919 14:01:17.078284 16231 math_functions.cu:81] Check failed: error == cudaSuccess (2 vs. 0)  out of memory

After getting the error, the model also disappears completely. It doesn't matter if it's trained to the end or not, it's just disappears from the model list, and i get 404 errors when trying to open it. Interesting might be also that when trying to classify an image during training, after getting the error, the model is not updated and listed in the web-interface anymore, but the GPU and CPU seems still to working on. So probably caffe is working further in the background.

A log that appears all the time is also

Caught PicklingError while saving job: Can't pickle <class 'caffe_pb2.NetParameter'>: it's not found as caffe_pb2.NetParameter

but i don't now if it's related with the bug somehow.

@drozdvadym
Copy link

+1, get the same error

  • GPU Usage GeForce GTX 660 Ti (#0)
  • Linux 3.13.0-63-generic x86_64 GNU/Linux
  • caffe NVIDIA fork 0.13, DIGITS 2.2.1

@lukeyeager
Copy link
Member Author

See here also:
https://groups.google.com/d/msg/digits-users/8-Bqik4nECI/HDQmqzSuBQAJ

Oddly, his problem went away by upgrading to the latest Caffe and DIGITS.

@lukeyeager
Copy link
Member Author

Seems to be solved for this guy with v0.14:
http://www.learnopencv.com/nvidia-digits-3-on-ec2/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants