Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda problem when training imagenet model #2809

Closed
7oud opened this issue Jul 23, 2015 · 9 comments
Closed

cuda problem when training imagenet model #2809

7oud opened this issue Jul 23, 2015 · 9 comments

Comments

@7oud
Copy link

7oud commented Jul 23, 2015

I got some error when training the imagenet model. I just followed the "ImageNet tutorial" step by step.

I0723 10:32:54.606801 11097 net.cpp:247] Network initialization done.
I0723 10:32:54.606806 11097 net.cpp:248] Memory required for data: 343607608
I0723 10:32:54.606875 11097 solver.cpp:42] Solver scaffolding done.
I0723 10:32:54.606909 11097 solver.cpp:250] Solving CaffeNet
I0723 10:32:54.606917 11097 solver.cpp:251] Learning Rate Policy: step
I0723 10:32:54.608069 11097 solver.cpp:294] Iteration 0, Testing net (#0)
I0723 10:35:43.717468 11097 solver.cpp:343] Test net output #0: accuracy = 0.001
I0723 10:35:43.738884 11097 solver.cpp:343] Test net output #1: loss = 7.13172 (* 1 = 7.13172 loss)
I0723 10:35:45.025822 11097 solver.cpp:214] Iteration 0, loss = 7.55976
I0723 10:35:45.025863 11097 solver.cpp:229] Train net output #0: loss = 7.55976 (* 1 = 7.55976 loss)
I0723 10:35:45.042021 11097 solver.cpp:486] Iteration 0, lr = 0.01
I0723 10:35:59.898746 11097 solver.cpp:214] Iteration 20, loss = 7.16701
I0723 10:35:59.898790 11097 solver.cpp:229] Train net output #0: loss = 7.16701 (* 1 = 7.16701 loss)
I0723 10:35:59.898805 11097 solver.cpp:486] Iteration 20, lr = 0.01
I0723 10:36:14.717177 11097 solver.cpp:214] Iteration 40, loss = 7.03349
I0723 10:36:14.717278 11097 solver.cpp:229] Train net output #0: loss = 7.03349 (* 1 = 7.03349 loss)
I0723 10:36:14.717293 11097 solver.cpp:486] Iteration 40, lr = 0.01
F0723 10:36:21.642776 11097 math_functions.cu:28] Check failed: status == CUBLAS_STATUS_SUCCESS (13 vs. 0) CUBLAS_STATUS_EXECUTION_FAILED
*** Check failure stack trace: ***
@ 0x7f8577c29378 google::LogMessage::Fail()
@ 0x7f8577c292c4 google::LogMessage::SendToLog()
@ 0x7f8577c28cb2 google::LogMessage::Flush()
@ 0x7f8577c2bbb1 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f85780eabff caffe::caffe_gpu_gemm<>()
@ 0x7f8577fd2a58 caffe::BaseConvolutionLayer<>::weight_gpu_gemm()
@ 0x7f85780d1a97 caffe::ConvolutionLayer<>::Backward_gpu()
@ 0x7f8577f9f306 caffe::Net<>::BackwardFromTo()
@ 0x7f8577f9f511 caffe::Net<>::Backward()
@ 0x7f8577fb8967 caffe::Solver<>::Step()
@ 0x7f8577fb9447 caffe::Solver<>::Solve()
@ 0x408e39 train()
@ 0x40566b main
@ 0x7f857714576d (unknown)
@ 0x405af1 (unknown)

Is the problem with my cuda config or sth else?

@7oud
Copy link
Author

7oud commented Jul 23, 2015

When I run "make runtest", 16 tests failed. Will it be the reason of the above error ?

[ PASSED ] 1340 tests.
[ FAILED ] 16 tests, listed below:
[ FAILED ] MemoryDataLayerTest/0.TestSetBatchSize, where TypeParam = caffe::CPUDevice
[ FAILED ] MemoryDataLayerTest/0.AddMatVectorDefaultTransform, where TypeParam = caffe::CPUDevice
[ FAILED ] MemoryDataLayerTest/1.TestSetBatchSize, where TypeParam = caffe::CPUDevice
[ FAILED ] MemoryDataLayerTest/1.AddMatVectorDefaultTransform, where TypeParam = caffe::CPUDevice
[ FAILED ] MemoryDataLayerTest/2.TestSetBatchSize, where TypeParam = caffe::GPUDevice
[ FAILED ] MemoryDataLayerTest/2.AddMatVectorDefaultTransform, where TypeParam = caffe::GPUDevice
[ FAILED ] MemoryDataLayerTest/3.TestSetBatchSize, where TypeParam = caffe::GPUDevice
[ FAILED ] MemoryDataLayerTest/3.AddMatVectorDefaultTransform, where TypeParam = caffe::GPUDevice
[ FAILED ] IOTest.TestDecodeDatumToCVMatContent
[ FAILED ] IOTest.TestDecodeDatumToCVMatNative
[ FAILED ] IOTest.TestDecodeDatumToCVMatNativeGray
[ FAILED ] IOTest.TestDecodeDatumToCVMatContentNative
[ FAILED ] IOTest.TestDecodeDatumToCVMat
[ FAILED ] IOTest.TestDecodeDatumNativeGray
[ FAILED ] IOTest.TestDecodeDatum
[ FAILED ] IOTest.TestDecodeDatumNative

16 FAILED TESTS
YOU HAVE 2 DISABLED TESTS

@7oud
Copy link
Author

7oud commented Jul 23, 2015

My configuration
GPU: 2 x 780Ti with 3GB ram
Batch size: 128
Ubuntu: 12.04 (wubi installation)

@smartbitcoin
Copy link

test case should be 100% passed.
but does your "ImageNet" used any layer in your 16 failed test case? ( I am not familiar with ImageNet) .
if the case just failed and you use it. that maybe the reason your training crash.

@7oud
Copy link
Author

7oud commented Jul 24, 2015

It seems that the 16 failed test cases will not cause the error like " CUBLAS_...", and why the error came out at iteration 40 rather than beginning

@smartbitcoin
Copy link

looks like it's a out of memery crash.

I0723 10:32:54.606806 11097 net.cpp:248] Memory required for data: 343607608

log shows that 300M allloc, and real usage perhapes 10X that , so how many vram do you have ? if it's 780Ti 3G, could you try nvidia-smi to confirm still vram available?

@7oud
Copy link
Author

7oud commented Jul 24, 2015

Thanks. I'll check the memory usage.
Caffe runs out of memory when batch size is 256, so I adjust the size to 128. Unfortunately it still crash, and the log has not explicit memory problem

@7oud
Copy link
Author

7oud commented Jul 27, 2015

Error fixed with batch size 64, and vram consumed about 1.6GB. It may run out of memory when using batch size

@7oud 7oud closed this as completed Jul 27, 2015
@shelhamer
Copy link
Member

Note gradient accumulation #1977 for working with reduced memory. See the iter_size solver field.

@asa008
Copy link

asa008 commented Mar 5, 2019

math_functions.cu:28] Check failed: status == CUBLAS_STATUS_SUCCESS (13 vs. 0) CUBLAS_STATUS_EXECUTION_FAILED

The above error will occur when installing CUDA9.0
By installing patchesPatch 2 (Released Mar 5, 2018)solve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants