image_classification cifar-10 train error #95

apollos · 2016-09-19T15:38:13Z

Follow the guide and run train.sh in Paddle/demo/image_classification, after around 1 minutes, I get a cudnn error.

My env:
OS ubuntu 14.04.4
cuda: 7.5
cudnn: libcudnn.so.5.0.5

yu@baymax:~/workspace/Paddle/demo/image_classification$ ./train.sh
I0919 23:32:31.849189 32690 Util.cpp:138] commandline: /opt/paddle/bin/paddle_trainer --config=vgg_16_cifar.py --dot_period=10 --log_period=100 --test_all_data_in_one_period=1 --use_gpu=1 --trainer_count=1 --num_passes=200 --save_dir=./cifar_vgg_model
I0919 23:32:32.094377 32690 Util.cpp:113] Calling runInitFunctions
I0919 23:32:32.094532 32690 Util.cpp:126] Call runInitFunctions done.
[INFO 2016-09-19 23:32:32,122 layers.py:1499] channels=3 size=3072
[INFO 2016-09-19 23:32:32,123 layers.py:1499] output size for conv_0 is 32
[INFO 2016-09-19 23:32:32,123 layers.py:1499] channels=64 size=65536
[INFO 2016-09-19 23:32:32,124 layers.py:1499] output size for conv_1 is 32
[INFO 2016-09-19 23:32:32,125 layers.py:1560] output size for pool_0 is 16_16
[INFO 2016-09-19 23:32:32,125 layers.py:1499] channels=64 size=16384
[INFO 2016-09-19 23:32:32,125 layers.py:1499] output size for conv_2 is 16
[INFO 2016-09-19 23:32:32,126 layers.py:1499] channels=128 size=32768
[INFO 2016-09-19 23:32:32,126 layers.py:1499] output size for conv_3 is 16
[INFO 2016-09-19 23:32:32,127 layers.py:1560] output size for pool_1 is 8_8
[INFO 2016-09-19 23:32:32,127 layers.py:1499] channels=128 size=8192
[INFO 2016-09-19 23:32:32,128 layers.py:1499] output size for conv_4 is 8
[INFO 2016-09-19 23:32:32,128 layers.py:1499] channels=256 size=16384
[INFO 2016-09-19 23:32:32,129 layers.py:1499] output size for conv_5 is 8
[INFO 2016-09-19 23:32:32,129 layers.py:1499] channels=256 size=16384
[INFO 2016-09-19 23:32:32,130 layers.py:1499] output size for conv_6 is 8
[INFO 2016-09-19 23:32:32,130 layers.py:1560] output size for pool_2 is 4_4
[INFO 2016-09-19 23:32:32,131 layers.py:1499] channels=256 size=4096
[INFO 2016-09-19 23:32:32,131 layers.py:1499] output size for conv_7 is 4
[INFO 2016-09-19 23:32:32,132 layers.py:1499] channels=512 size=8192
[INFO 2016-09-19 23:32:32,132 layers.py:1499] output size for conv_8 is 4
[INFO 2016-09-19 23:32:32,133 layers.py:1499] channels=512 size=8192
[INFO 2016-09-19 23:32:32,133 layers.py:1499] output size for conv_9 is 4
[INFO 2016-09-19 23:32:32,134 layers.py:1560] output size for pool_3 is 2_2
[INFO 2016-09-19 23:32:32,134 layers.py:1560] output size for pool_4 is 1_1
[INFO 2016-09-19 23:32:32,136 networks.py:1122] The input order is [image, label]
[INFO 2016-09-19 23:32:32,136 networks.py:1129] The output order is [cost_0]
I0919 23:32:32.141587 32690 Trainer.cpp:169] trainer mode: Normal
I0919 23:32:32.147541 32690 PyDataProvider2.cpp:219] loading dataprovider image_provider::processData
[INFO 2016-09-19 23:32:32,168 image_provider.py:52] Image size: 32
[INFO 2016-09-19 23:32:32,168 image_provider.py:53] Meta path: data/cifar-out/batches/batches.meta
[INFO 2016-09-19 23:32:32,169 image_provider.py:58] DataProvider Initialization finished
I0919 23:32:32.169145 32690 PyDataProvider2.cpp:219] loading dataprovider image_provider::processData
[INFO 2016-09-19 23:32:32,169 image_provider.py:52] Image size: 32
[INFO 2016-09-19 23:32:32,169 image_provider.py:53] Meta path: data/cifar-out/batches/batches.meta
[INFO 2016-09-19 23:32:32,169 image_provider.py:58] DataProvider Initialization finished
I0919 23:32:32.169431 32690 GradientMachine.cpp:134] Initing parameters..
I0919 23:32:32.531777 32690 GradientMachine.cpp:141] Init parameters done.
.........
I0919 23:32:53.036734 32690 TrainerInternal.cpp:162] Batch=100 samples=12800 AvgCost=2.38877 CurrentCost=2.38877 Eval: classification_error_evaluator=0.834219 CurrentEval: classification_error_evaluator=0.834219
.........
I0919 23:33:03.698333 32690 TrainerInternal.cpp:162] Batch=200 samples=25600 AvgCost=2.17996 CurrentCost=1.97115 Eval: classification_error_evaluator=0.786719 CurrentEval: classification_error_evaluator=0.739219
.........
I0919 23:33:14.348435 32690 TrainerInternal.cpp:162] Batch=300 samples=38400 AvgCost=2.02001 CurrentCost=1.7001 Eval: classification_error_evaluator=0.7425 CurrentEval: classification_error_evaluator=0.654062
.........I0919 23:33:23.978550 32690 TrainerInternal.cpp:179] Pass=0 Batch=391 samples=50048 AvgCost=1.90913 Eval: classification_error_evaluator=0.70658
F0919 23:33:26.893599 32690 hl_cuda_cudnn.cc:779] Check failed: CUDNN_STATUS_SUCCESS == cudnnStat (0 vs. 5) Cudnn Error: CUDNN_STATUS_INVALID_VALUE
*_* Check failure stack trace: ***
@ 0x7efcce9e2daa (unknown)
@ 0x7efcce9e2ce4 (unknown)
@ 0x7efcce9e26e6 (unknown)
@ 0x7efcce9e5687 (unknown)
@ 0x8affc4 hl_convolution_forward()
@ 0x64bb3c paddle::CudnnConvLayer::forward()
@ 0x5a2130 paddle::NeuralNetwork::forward()
@ 0x6bb43f paddle::Tester::testOneBatch()
@ 0x6bbd52 paddle::Tester::testOnePeriod()
@ 0x6a052c paddle::Trainer::trainOnePass()
@ 0x6a3927 paddle::Trainer::train()
@ 0x53bfd3 main
@ 0x7efccdbeef45 (unknown)
@ 0x547795 (unknown)
@ (nil) (unknown)
/usr/local/bin/paddle: line 81: 32690 Aborted (core dumped) ${DEBUGGER} /opt/paddle/bin/paddle_trainer ${@:2}

gangliao · 2016-09-21T02:18:50Z

Hi, can you paste your GPU model number？ For instance, GTX TITANX?

apollos · 2016-09-21T05:11:14Z

GTX970

apollos · 2016-09-21T12:45:54Z

I update the cudnn to libcudnn.so.5.1.3 but still get the same issue.

gangliao · 2016-09-21T12:52:05Z

I already saw several users have the same problem. But, we still can not reproduce this issue. We will try to reproduce and fix it as soon as possible.

qingqing01 · 2016-09-23T10:48:09Z

@apollos the bug is fixed, see #107. Thanks for your attention.

* update dist train doc * update * fix style * update

* add Split test to test02 * add vectorize to test01 and add align to buffer

* update couplet dataset and example * add couplet dataset * delete target vocab info, cause couplet only has one vocab * add copyright info for couplet data, and fix comment typo * Update environment dependency information * update save dir in seq2seq readme * fix couplet convert_example func bug * fix vocab typo

* uniq feature * uniq feature * uniq feature

* add some custom gan models

* Optimizing the zero key problem in the push phase * Optimize CUDA thread parallelism in MergeGrad phase * Optimize CUDA thread parallelism in MergeGrad phase * Performance optimization, segment gradient merging * Performance optimization, segment gradient merging * Optimize pullsparse and increase keys aggregation * sync gpugraph to gpugraph_v2 (#86) * change load node and edge from local to cpu (#83) * change load node and edge * remove useless code Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * extract pull sparse as single stage(#85) Co-authored-by: yangjunchao <yangjunchao@baidu.com> Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com> Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com> Co-authored-by: yangjunchao <yangjunchao@baidu.com> * [GPUGraph] graph sample v2 (#87) * change load node and edge from local to cpu (#83) * change load node and edge * remove useless code Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * extract pull sparse as single stage(#85) Co-authored-by: yangjunchao <yangjunchao@baidu.com> * support ssdsparsetable;test=develop (#81) * graph sample v2 * remove log Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com> Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com> Co-authored-by: yangjunchao <yangjunchao@baidu.com> Co-authored-by: danleifeng <52735331+danleifeng@users.noreply.github.com> * Release cpu graph * uniq nodeid (#89) * compatible whole HBM mode (#91) Co-authored-by: yangjunchao <yangjunchao@baidu.com> * Gpugraph v2 (#93) * compatible whole HBM mode * unify flag for graph emd storage mode and graph struct storage mode * format Co-authored-by: yangjunchao <yangjunchao@baidu.com> * split generate batch into multi stage (#92) * split generate batch into multi stage * fix conflict Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * [GpuGraph] Uniq feature (#95) * uniq feature * uniq feature * uniq feature * [GpuGraph] global startid (#98) * uniq feature * uniq feature * uniq feature * global startid * load node edge seperately and release graph (#99) * load node edge seperately and release graph * load node edge seperately and release graph Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * v2 infer (#102) * optimize begin pass and end pass (#106) Co-authored-by: yangjunchao <yangjunchao@baidu.com> * fix ins no (#104) * [GPUGraph] fix FillOneStep args (#107) * fix ins no * fix FillOnestep args * fix bug for whole hbm mode (#110) Co-authored-by: yangjunchao <yangjunchao@baidu.com> * [GPUGraph] fix infer && add infer_table_cap (#108) * fix ins no * fix FillOnestep args * fix infer && add infer table cap * fix infer * 【PSCORE】perform ssd sparse table (#111) * perform ssd sparsetable;test=develop Conflicts: paddle/fluid/framework/fleet/ps_gpu_wrapper.cc * perform ssd sparsetable;test=develop * remove debug code; * remove debug code; * add jemalloc cmake;test=develop * fix wrapper;test=develop * fix sample core (#114) * [GpuGraph] optimize shuffle batch (#115) * fix sample core * optimize shuffle batch * release gpu mem when sample end (#116) Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * fix class not found err (PaddlePaddle#118) Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * optimize sample (PaddlePaddle#117) * optimize sample * optimize sample Co-authored-by: yangjunchao <yangjunchao@baidu.com> * fix clear gpu mem (PaddlePaddle#119) Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * fix sample core (PaddlePaddle#121) Co-authored-by: yangjunchao <yangjunchao@baidu.com> * add ssd cache (PaddlePaddle#123) * add ssd cache;test=develop * add ssd cache;test=develop * add ssd cache;test=develop * add multi epoch train & fix train table change ins & save infer embeding (PaddlePaddle#129) * add multi epoch train & fix train table change ins & save infer embedding * change epoch finish judge * change epoch finish change Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * Add debug log (PaddlePaddle#131) * Add debug log * Add debug log Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0008.yq01.baidu.com> * optimize mem in uniq slot feature (PaddlePaddle#130) * [GpuGraph] cherry pick var slot feature && fix load multi path node (PaddlePaddle#136) * optimize mem in uniq slot feature * cherry-pick var slot_feature Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com> * [GpuGraph] fix kernel overflow (PaddlePaddle#138) * optimize mem in uniq slot feature * cherry-pick var slot_feature * fix kernel overflow && add max feature num flag Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com> * fix ssd cache;test=develop (PaddlePaddle#139) * slot feature secondary storage (PaddlePaddle#140) * slot feature secondary storage * slot feature secondary storage Co-authored-by: yangjunchao <yangjunchao@baidu.com> Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0008.yq01.baidu.com> Co-authored-by: xuewujiao <105861147+xuewujiao@users.noreply.github.com> Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com> Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com> Co-authored-by: yangjunchao <yangjunchao@baidu.com> Co-authored-by: Thunderbrook <52529258+Thunderbrook@users.noreply.github.com> Co-authored-by: danleifeng <52735331+danleifeng@users.noreply.github.com> Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com>

* Optimizing the zero key problem in the push phase * Optimize CUDA thread parallelism in MergeGrad phase * Optimize CUDA thread parallelism in MergeGrad phase * Performance optimization, segment gradient merging * Performance optimization, segment gradient merging * Optimize pullsparse and increase keys aggregation * sync gpugraph to gpugraph_v2 (PaddlePaddle#86) * change load node and edge from local to cpu (PaddlePaddle#83) * change load node and edge * remove useless code Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * extract pull sparse as single stage(PaddlePaddle#85) Co-authored-by: yangjunchao <yangjunchao@baidu.com> Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com> Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com> Co-authored-by: yangjunchao <yangjunchao@baidu.com> * [GPUGraph] graph sample v2 (PaddlePaddle#87) * change load node and edge from local to cpu (PaddlePaddle#83) * change load node and edge * remove useless code Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * extract pull sparse as single stage(PaddlePaddle#85) Co-authored-by: yangjunchao <yangjunchao@baidu.com> * support ssdsparsetable;test=develop (PaddlePaddle#81) * graph sample v2 * remove log Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com> Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com> Co-authored-by: yangjunchao <yangjunchao@baidu.com> Co-authored-by: danleifeng <52735331+danleifeng@users.noreply.github.com> * Release cpu graph * uniq nodeid (PaddlePaddle#89) * compatible whole HBM mode (PaddlePaddle#91) Co-authored-by: yangjunchao <yangjunchao@baidu.com> * Gpugraph v2 (PaddlePaddle#93) * compatible whole HBM mode * unify flag for graph emd storage mode and graph struct storage mode * format Co-authored-by: yangjunchao <yangjunchao@baidu.com> * split generate batch into multi stage (PaddlePaddle#92) * split generate batch into multi stage * fix conflict Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * [GpuGraph] Uniq feature (PaddlePaddle#95) * uniq feature * uniq feature * uniq feature * [GpuGraph] global startid (PaddlePaddle#98) * uniq feature * uniq feature * uniq feature * global startid * load node edge seperately and release graph (PaddlePaddle#99) * load node edge seperately and release graph * load node edge seperately and release graph Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * v2 infer (PaddlePaddle#102) * optimize begin pass and end pass (PaddlePaddle#106) Co-authored-by: yangjunchao <yangjunchao@baidu.com> * fix ins no (PaddlePaddle#104) * [GPUGraph] fix FillOneStep args (PaddlePaddle#107) * fix ins no * fix FillOnestep args * fix bug for whole hbm mode (PaddlePaddle#110) Co-authored-by: yangjunchao <yangjunchao@baidu.com> * [GPUGraph] fix infer && add infer_table_cap (PaddlePaddle#108) * fix ins no * fix FillOnestep args * fix infer && add infer table cap * fix infer * 【PSCORE】perform ssd sparse table (PaddlePaddle#111) * perform ssd sparsetable;test=develop Conflicts: paddle/fluid/framework/fleet/ps_gpu_wrapper.cc * perform ssd sparsetable;test=develop * remove debug code; * remove debug code; * add jemalloc cmake;test=develop * fix wrapper;test=develop * fix sample core (PaddlePaddle#114) * [GpuGraph] optimize shuffle batch (PaddlePaddle#115) * fix sample core * optimize shuffle batch * release gpu mem when sample end (PaddlePaddle#116) Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * fix class not found err (PaddlePaddle#118) Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * optimize sample (PaddlePaddle#117) * optimize sample * optimize sample Co-authored-by: yangjunchao <yangjunchao@baidu.com> * fix clear gpu mem (PaddlePaddle#119) Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * fix sample core (PaddlePaddle#121) Co-authored-by: yangjunchao <yangjunchao@baidu.com> * add ssd cache (PaddlePaddle#123) * add ssd cache;test=develop * add ssd cache;test=develop * add ssd cache;test=develop * add multi epoch train & fix train table change ins & save infer embeding (PaddlePaddle#129) * add multi epoch train & fix train table change ins & save infer embedding * change epoch finish judge * change epoch finish change Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * Add debug log (PaddlePaddle#131) * Add debug log * Add debug log Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0008.yq01.baidu.com> * optimize mem in uniq slot feature (PaddlePaddle#130) * [GpuGraph] cherry pick var slot feature && fix load multi path node (PaddlePaddle#136) * optimize mem in uniq slot feature * cherry-pick var slot_feature Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com> * [GpuGraph] fix kernel overflow (PaddlePaddle#138) * optimize mem in uniq slot feature * cherry-pick var slot_feature * fix kernel overflow && add max feature num flag Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com> * fix ssd cache;test=develop (PaddlePaddle#139) * slot feature secondary storage (PaddlePaddle#140) * slot feature secondary storage * slot feature secondary storage Co-authored-by: yangjunchao <yangjunchao@baidu.com> Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0008.yq01.baidu.com> Co-authored-by: xuewujiao <105861147+xuewujiao@users.noreply.github.com> Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com> Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com> Co-authored-by: yangjunchao <yangjunchao@baidu.com> Co-authored-by: Thunderbrook <52529258+Thunderbrook@users.noreply.github.com> Co-authored-by: danleifeng <52735331+danleifeng@users.noreply.github.com> Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com>

* clear repo * init plsc v2.2.0 * update README.md

* sharding mode, erase other device param * sharding mode support avg_weight

Add gpt-neox adoption

gangliao added the Bug label Sep 21, 2016

gangliao closed this as completed Sep 23, 2016

alvations mentioned this issue Oct 5, 2016

"cudaSuccess == err (0 vs. 8)" error on v0.8.0b1 #158

Closed

zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this issue Sep 25, 2019

Update cluster train doc (PaddlePaddle#95)

9cfbb06

* update dist train doc * update * fix style * update

DemoMoon mentioned this issue Mar 24, 2021

oneDNN 如何能提升DeepSpeech的语音处理性能 #31838

Closed

thisjiang pushed a commit to thisjiang/Paddle that referenced this issue Oct 28, 2021

refine Vectorize and x86 builtin codebase (PaddlePaddle#95)

ff82b53

* add Split test to test02 * add vectorize to test01 and add align to buffer

paddle-bot-old bot referenced this issue Feb 16, 2022

Add ConditionalBlockGradInferVarType (#39585)

ff7e359

Thunderbrook added a commit to Thunderbrook/Paddle that referenced this issue Aug 31, 2022

[GpuGraph] Uniq feature (PaddlePaddle#95)

6a8ea82

* uniq feature * uniq feature * uniq feature

AnnaTrainingG pushed a commit to AnnaTrainingG/Paddle that referenced this issue Sep 19, 2022

add some custom gan models and datasets (PaddlePaddle#95)

b2b881e

* add some custom gan models

qizhaoaoe pushed a commit to qizhaoaoe/Paddle that referenced this issue Mar 3, 2023

Update README (PaddlePaddle#95)

40f4612

* clear repo * init plsc v2.2.0 * update README.md

qingshui pushed a commit to jiaoxuewu/PaddleBox that referenced this issue Dec 1, 2023

sharding mode support avg_weight (PaddlePaddle#95)

0bd927a

* sharding mode, erase other device param * sharding mode support avg_weight

AnnaTrainingG pushed a commit to AnnaTrainingG/Paddle that referenced this issue Dec 6, 2023

Merge pull request PaddlePaddle#95 from Quentin-Anthony/patch-1

0296171

Add gpt-neox adoption

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

image_classification cifar-10 train error #95

image_classification cifar-10 train error #95

apollos commented Sep 19, 2016

gangliao commented Sep 21, 2016

apollos commented Sep 21, 2016

apollos commented Sep 21, 2016

gangliao commented Sep 21, 2016

qingqing01 commented Sep 23, 2016

image_classification cifar-10 train error #95

image_classification cifar-10 train error #95

Comments

apollos commented Sep 19, 2016

gangliao commented Sep 21, 2016

apollos commented Sep 21, 2016

apollos commented Sep 21, 2016

gangliao commented Sep 21, 2016

qingqing01 commented Sep 23, 2016