cuda problem when training imagenet model #2809

7oud · 2015-07-23T06:29:23Z

I got some error when training the imagenet model. I just followed the "ImageNet tutorial" step by step.

I0723 10:32:54.606801 11097 net.cpp:247] Network initialization done.
I0723 10:32:54.606806 11097 net.cpp:248] Memory required for data: 343607608
I0723 10:32:54.606875 11097 solver.cpp:42] Solver scaffolding done.
I0723 10:32:54.606909 11097 solver.cpp:250] Solving CaffeNet
I0723 10:32:54.606917 11097 solver.cpp:251] Learning Rate Policy: step
I0723 10:32:54.608069 11097 solver.cpp:294] Iteration 0, Testing net (#0)
I0723 10:35:43.717468 11097 solver.cpp:343] Test net output #0: accuracy = 0.001
I0723 10:35:43.738884 11097 solver.cpp:343] Test net output #1: loss = 7.13172 (* 1 = 7.13172 loss)
I0723 10:35:45.025822 11097 solver.cpp:214] Iteration 0, loss = 7.55976
I0723 10:35:45.025863 11097 solver.cpp:229] Train net output #0: loss = 7.55976 (* 1 = 7.55976 loss)
I0723 10:35:45.042021 11097 solver.cpp:486] Iteration 0, lr = 0.01
I0723 10:35:59.898746 11097 solver.cpp:214] Iteration 20, loss = 7.16701
I0723 10:35:59.898790 11097 solver.cpp:229] Train net output #0: loss = 7.16701 (* 1 = 7.16701 loss)
I0723 10:35:59.898805 11097 solver.cpp:486] Iteration 20, lr = 0.01
I0723 10:36:14.717177 11097 solver.cpp:214] Iteration 40, loss = 7.03349
I0723 10:36:14.717278 11097 solver.cpp:229] Train net output #0: loss = 7.03349 (* 1 = 7.03349 loss)
I0723 10:36:14.717293 11097 solver.cpp:486] Iteration 40, lr = 0.01
F0723 10:36:21.642776 11097 math_functions.cu:28] Check failed: status == CUBLAS_STATUS_SUCCESS (13 vs. 0) CUBLAS_STATUS_EXECUTION_FAILED
*** Check failure stack trace: ***
@ 0x7f8577c29378 google::LogMessage::Fail()
@ 0x7f8577c292c4 google::LogMessage::SendToLog()
@ 0x7f8577c28cb2 google::LogMessage::Flush()
@ 0x7f8577c2bbb1 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f85780eabff caffe::caffe_gpu_gemm<>()
@ 0x7f8577fd2a58 caffe::BaseConvolutionLayer<>::weight_gpu_gemm()
@ 0x7f85780d1a97 caffe::ConvolutionLayer<>::Backward_gpu()
@ 0x7f8577f9f306 caffe::Net<>::BackwardFromTo()
@ 0x7f8577f9f511 caffe::Net<>::Backward()
@ 0x7f8577fb8967 caffe::Solver<>::Step()
@ 0x7f8577fb9447 caffe::Solver<>::Solve()
@ 0x408e39 train()
@ 0x40566b main
@ 0x7f857714576d (unknown)
@ 0x405af1 (unknown)

Is the problem with my cuda config or sth else?

7oud · 2015-07-23T09:07:23Z

When I run "make runtest", 16 tests failed. Will it be the reason of the above error ?

[ PASSED ] 1340 tests.
[ FAILED ] 16 tests, listed below:
[ FAILED ] MemoryDataLayerTest/0.TestSetBatchSize, where TypeParam = caffe::CPUDevice
[ FAILED ] MemoryDataLayerTest/0.AddMatVectorDefaultTransform, where TypeParam = caffe::CPUDevice
[ FAILED ] MemoryDataLayerTest/1.TestSetBatchSize, where TypeParam = caffe::CPUDevice
[ FAILED ] MemoryDataLayerTest/1.AddMatVectorDefaultTransform, where TypeParam = caffe::CPUDevice
[ FAILED ] MemoryDataLayerTest/2.TestSetBatchSize, where TypeParam = caffe::GPUDevice
[ FAILED ] MemoryDataLayerTest/2.AddMatVectorDefaultTransform, where TypeParam = caffe::GPUDevice
[ FAILED ] MemoryDataLayerTest/3.TestSetBatchSize, where TypeParam = caffe::GPUDevice
[ FAILED ] MemoryDataLayerTest/3.AddMatVectorDefaultTransform, where TypeParam = caffe::GPUDevice
[ FAILED ] IOTest.TestDecodeDatumToCVMatContent
[ FAILED ] IOTest.TestDecodeDatumToCVMatNative
[ FAILED ] IOTest.TestDecodeDatumToCVMatNativeGray
[ FAILED ] IOTest.TestDecodeDatumToCVMatContentNative
[ FAILED ] IOTest.TestDecodeDatumToCVMat
[ FAILED ] IOTest.TestDecodeDatumNativeGray
[ FAILED ] IOTest.TestDecodeDatum
[ FAILED ] IOTest.TestDecodeDatumNative

16 FAILED TESTS
YOU HAVE 2 DISABLED TESTS

7oud · 2015-07-23T14:25:48Z

My configuration
GPU: 2 x 780Ti with 3GB ram
Batch size: 128
Ubuntu: 12.04 (wubi installation)

smartbitcoin · 2015-07-23T23:21:45Z

test case should be 100% passed.
but does your "ImageNet" used any layer in your 16 failed test case? ( I am not familiar with ImageNet) .
if the case just failed and you use it. that maybe the reason your training crash.

7oud · 2015-07-24T00:35:31Z

It seems that the 16 failed test cases will not cause the error like " CUBLAS_...", and why the error came out at iteration 40 rather than beginning

smartbitcoin · 2015-07-24T01:00:07Z

looks like it's a out of memery crash.

I0723 10:32:54.606806 11097 net.cpp:248] Memory required for data: 343607608

log shows that 300M allloc, and real usage perhapes 10X that , so how many vram do you have ? if it's 780Ti 3G, could you try nvidia-smi to confirm still vram available?

7oud · 2015-07-24T01:50:09Z

Thanks. I'll check the memory usage.
Caffe runs out of memory when batch size is 256, so I adjust the size to 128. Unfortunately it still crash, and the log has not explicit memory problem

7oud · 2015-07-27T07:05:09Z

Error fixed with batch size 64, and vram consumed about 1.6GB. It may run out of memory when using batch size

shelhamer · 2015-07-27T17:51:20Z

Note gradient accumulation #1977 for working with reduced memory. See the iter_size solver field.

asa008 · 2019-03-05T13:34:18Z

math_functions.cu:28] Check failed: status == CUBLAS_STATUS_SUCCESS (13 vs. 0) CUBLAS_STATUS_EXECUTION_FAILED

The above error will occur when installing CUDA9.0
By installing patchesPatch 2 (Released Mar 5, 2018)solve

7oud closed this as completed Jul 27, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda problem when training imagenet model #2809

cuda problem when training imagenet model #2809

7oud commented Jul 23, 2015

7oud commented Jul 23, 2015

7oud commented Jul 23, 2015

smartbitcoin commented Jul 23, 2015

7oud commented Jul 24, 2015

smartbitcoin commented Jul 24, 2015

7oud commented Jul 24, 2015

7oud commented Jul 27, 2015

shelhamer commented Jul 27, 2015

asa008 commented Mar 5, 2019

cuda problem when training imagenet model #2809

cuda problem when training imagenet model #2809

Comments

7oud commented Jul 23, 2015

7oud commented Jul 23, 2015

7oud commented Jul 23, 2015

smartbitcoin commented Jul 23, 2015

7oud commented Jul 24, 2015

smartbitcoin commented Jul 24, 2015

7oud commented Jul 24, 2015

7oud commented Jul 27, 2015

shelhamer commented Jul 27, 2015

asa008 commented Mar 5, 2019