Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

image_classification cifar-10 train error #95

Closed
apollos opened this issue Sep 19, 2016 · 5 comments
Closed

image_classification cifar-10 train error #95

apollos opened this issue Sep 19, 2016 · 5 comments
Labels

Comments

@apollos
Copy link

apollos commented Sep 19, 2016

Follow the guide and run train.sh in Paddle/demo/image_classification, after around 1 minutes, I get a cudnn error.

My env:
OS ubuntu 14.04.4
cuda: 7.5
cudnn: libcudnn.so.5.0.5

yu@baymax:~/workspace/Paddle/demo/image_classification$ ./train.sh
I0919 23:32:31.849189 32690 Util.cpp:138] commandline: /opt/paddle/bin/paddle_trainer --config=vgg_16_cifar.py --dot_period=10 --log_period=100 --test_all_data_in_one_period=1 --use_gpu=1 --trainer_count=1 --num_passes=200 --save_dir=./cifar_vgg_model
I0919 23:32:32.094377 32690 Util.cpp:113] Calling runInitFunctions
I0919 23:32:32.094532 32690 Util.cpp:126] Call runInitFunctions done.
[INFO 2016-09-19 23:32:32,122 layers.py:1499] channels=3 size=3072
[INFO 2016-09-19 23:32:32,123 layers.py:1499] output size for conv_0 is 32
[INFO 2016-09-19 23:32:32,123 layers.py:1499] channels=64 size=65536
[INFO 2016-09-19 23:32:32,124 layers.py:1499] output size for conv_1 is 32
[INFO 2016-09-19 23:32:32,125 layers.py:1560] output size for pool_0 is 16_16
[INFO 2016-09-19 23:32:32,125 layers.py:1499] channels=64 size=16384
[INFO 2016-09-19 23:32:32,125 layers.py:1499] output size for conv_2 is 16
[INFO 2016-09-19 23:32:32,126 layers.py:1499] channels=128 size=32768
[INFO 2016-09-19 23:32:32,126 layers.py:1499] output size for conv_3 is 16
[INFO 2016-09-19 23:32:32,127 layers.py:1560] output size for pool_1 is 8_8
[INFO 2016-09-19 23:32:32,127 layers.py:1499] channels=128 size=8192
[INFO 2016-09-19 23:32:32,128 layers.py:1499] output size for conv_4 is 8
[INFO 2016-09-19 23:32:32,128 layers.py:1499] channels=256 size=16384
[INFO 2016-09-19 23:32:32,129 layers.py:1499] output size for conv_5 is 8
[INFO 2016-09-19 23:32:32,129 layers.py:1499] channels=256 size=16384
[INFO 2016-09-19 23:32:32,130 layers.py:1499] output size for conv_6 is 8
[INFO 2016-09-19 23:32:32,130 layers.py:1560] output size for pool_2 is 4_4
[INFO 2016-09-19 23:32:32,131 layers.py:1499] channels=256 size=4096
[INFO 2016-09-19 23:32:32,131 layers.py:1499] output size for conv_7 is 4
[INFO 2016-09-19 23:32:32,132 layers.py:1499] channels=512 size=8192
[INFO 2016-09-19 23:32:32,132 layers.py:1499] output size for conv_8 is 4
[INFO 2016-09-19 23:32:32,133 layers.py:1499] channels=512 size=8192
[INFO 2016-09-19 23:32:32,133 layers.py:1499] output size for conv_9 is 4
[INFO 2016-09-19 23:32:32,134 layers.py:1560] output size for pool_3 is 2_2
[INFO 2016-09-19 23:32:32,134 layers.py:1560] output size for pool_4 is 1_1
[INFO 2016-09-19 23:32:32,136 networks.py:1122] The input order is [image, label]
[INFO 2016-09-19 23:32:32,136 networks.py:1129] The output order is [cost_0]
I0919 23:32:32.141587 32690 Trainer.cpp:169] trainer mode: Normal
I0919 23:32:32.147541 32690 PyDataProvider2.cpp:219] loading dataprovider image_provider::processData
[INFO 2016-09-19 23:32:32,168 image_provider.py:52] Image size: 32
[INFO 2016-09-19 23:32:32,168 image_provider.py:53] Meta path: data/cifar-out/batches/batches.meta
[INFO 2016-09-19 23:32:32,169 image_provider.py:58] DataProvider Initialization finished
I0919 23:32:32.169145 32690 PyDataProvider2.cpp:219] loading dataprovider image_provider::processData
[INFO 2016-09-19 23:32:32,169 image_provider.py:52] Image size: 32
[INFO 2016-09-19 23:32:32,169 image_provider.py:53] Meta path: data/cifar-out/batches/batches.meta
[INFO 2016-09-19 23:32:32,169 image_provider.py:58] DataProvider Initialization finished
I0919 23:32:32.169431 32690 GradientMachine.cpp:134] Initing parameters..
I0919 23:32:32.531777 32690 GradientMachine.cpp:141] Init parameters done.
.........
I0919 23:32:53.036734 32690 TrainerInternal.cpp:162] Batch=100 samples=12800 AvgCost=2.38877 CurrentCost=2.38877 Eval: classification_error_evaluator=0.834219 CurrentEval: classification_error_evaluator=0.834219
.........
I0919 23:33:03.698333 32690 TrainerInternal.cpp:162] Batch=200 samples=25600 AvgCost=2.17996 CurrentCost=1.97115 Eval: classification_error_evaluator=0.786719 CurrentEval: classification_error_evaluator=0.739219
.........
I0919 23:33:14.348435 32690 TrainerInternal.cpp:162] Batch=300 samples=38400 AvgCost=2.02001 CurrentCost=1.7001 Eval: classification_error_evaluator=0.7425 CurrentEval: classification_error_evaluator=0.654062
.........I0919 23:33:23.978550 32690 TrainerInternal.cpp:179] Pass=0 Batch=391 samples=50048 AvgCost=1.90913 Eval: classification_error_evaluator=0.70658
F0919 23:33:26.893599 32690 hl_cuda_cudnn.cc:779] Check failed: CUDNN_STATUS_SUCCESS == cudnnStat (0 vs. 5) Cudnn Error: CUDNN_STATUS_INVALID_VALUE
*_* Check failure stack trace: ***
@ 0x7efcce9e2daa (unknown)
@ 0x7efcce9e2ce4 (unknown)
@ 0x7efcce9e26e6 (unknown)
@ 0x7efcce9e5687 (unknown)
@ 0x8affc4 hl_convolution_forward()
@ 0x64bb3c paddle::CudnnConvLayer::forward()
@ 0x5a2130 paddle::NeuralNetwork::forward()
@ 0x6bb43f paddle::Tester::testOneBatch()
@ 0x6bbd52 paddle::Tester::testOnePeriod()
@ 0x6a052c paddle::Trainer::trainOnePass()
@ 0x6a3927 paddle::Trainer::train()
@ 0x53bfd3 main
@ 0x7efccdbeef45 (unknown)
@ 0x547795 (unknown)
@ (nil) (unknown)
/usr/local/bin/paddle: line 81: 32690 Aborted (core dumped) ${DEBUGGER} /opt/paddle/bin/paddle_trainer ${@:2}

@gangliao
Copy link
Contributor

Hi, can you paste your GPU model number? For instance, GTX TITANX?

@gangliao gangliao added the Bug label Sep 21, 2016
@apollos
Copy link
Author

apollos commented Sep 21, 2016

GTX970

@apollos
Copy link
Author

apollos commented Sep 21, 2016

I update the cudnn to libcudnn.so.5.1.3 but still get the same issue.

@gangliao
Copy link
Contributor

I already saw several users have the same problem. But, we still can not reproduce this issue. We will try to reproduce and fix it as soon as possible.

@qingqing01
Copy link
Contributor

@apollos the bug is fixed, see #107. Thanks for your attention.

zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this issue Sep 25, 2019
* update dist train doc

* update

* fix style

* update
thisjiang pushed a commit to thisjiang/Paddle that referenced this issue Oct 28, 2021
* add Split test to test02

* add vectorize to test01 and add align to buffer
wangxicoding pushed a commit to wangxicoding/Paddle that referenced this issue Dec 9, 2021
* update couplet dataset and example

* add couplet dataset

* delete target vocab info, cause couplet only has one vocab

* add copyright info for couplet data, and fix comment typo

* Update environment dependency information

* update save dir in seq2seq readme

* fix couplet convert_example func bug

* fix vocab typo
Thunderbrook added a commit to Thunderbrook/Paddle that referenced this issue Aug 31, 2022
* uniq feature

* uniq feature

* uniq feature
AnnaTrainingG pushed a commit to AnnaTrainingG/Paddle that referenced this issue Sep 19, 2022
qingshui referenced this issue in qingshui/Paddle Nov 14, 2022
* Optimizing the zero key problem in the push phase

* Optimize CUDA thread parallelism in MergeGrad phase

* Optimize CUDA thread parallelism in MergeGrad phase

* Performance optimization, segment gradient merging

* Performance optimization, segment gradient merging

* Optimize pullsparse and increase keys aggregation

* sync gpugraph to gpugraph_v2 (#86)

* change load node and edge from local to cpu (#83)

* change load node and edge

* remove useless code

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* extract pull sparse as single stage(#85)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com>
Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>
Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com>
Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* [GPUGraph] graph sample v2 (#87)

* change load node and edge from local to cpu (#83)

* change load node and edge

* remove useless code

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* extract pull sparse as single stage(#85)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* support ssdsparsetable;test=develop (#81)

* graph sample v2

* remove log

Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com>
Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>
Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com>
Co-authored-by: yangjunchao <yangjunchao@baidu.com>
Co-authored-by: danleifeng <52735331+danleifeng@users.noreply.github.com>

* Release cpu graph

* uniq nodeid (#89)

* compatible whole HBM mode (#91)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* Gpugraph v2 (#93)

* compatible whole HBM mode

* unify flag for graph emd storage mode and graph struct storage mode

* format

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* split generate batch into multi stage (#92)

* split generate batch into multi stage

* fix conflict

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* [GpuGraph] Uniq feature (#95)

* uniq feature

* uniq feature

* uniq feature

* [GpuGraph]  global startid (#98)

* uniq feature

* uniq feature

* uniq feature

* global startid

* load node edge seperately and release graph (#99)

* load node edge seperately and release graph

* load node edge seperately and release graph

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* v2 infer (#102)

* optimize begin pass and end pass (#106)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* fix ins no (#104)

* [GPUGraph] fix FillOneStep args (#107)

* fix ins no

* fix FillOnestep args

* fix bug for whole hbm mode (#110)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* [GPUGraph] fix infer && add infer_table_cap (#108)

* fix ins no

* fix FillOnestep args

* fix infer && add infer table cap

* fix infer

* 【PSCORE】perform ssd sparse table  (#111)

* perform ssd sparsetable;test=develop

Conflicts:
	paddle/fluid/framework/fleet/ps_gpu_wrapper.cc

* perform ssd sparsetable;test=develop

* remove debug code;

* remove debug code;

* add jemalloc cmake;test=develop

* fix wrapper;test=develop

* fix sample core (#114)

* [GpuGraph] optimize shuffle batch (#115)

* fix sample core

* optimize shuffle batch

* release gpu mem when sample end (#116)

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* fix class not found err (PaddlePaddle#118)

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* optimize sample (PaddlePaddle#117)

* optimize sample

* optimize sample

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* fix clear gpu mem (PaddlePaddle#119)

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* fix sample core (PaddlePaddle#121)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* add ssd cache (PaddlePaddle#123)

* add ssd cache;test=develop

* add ssd cache;test=develop

* add ssd cache;test=develop

* add multi epoch train & fix train table change ins & save infer embeding  (PaddlePaddle#129)

* add multi epoch train & fix train table change ins & save infer embedding

* change epoch finish judge

* change epoch finish change

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* Add debug log (PaddlePaddle#131)

* Add debug log

* Add debug log

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0008.yq01.baidu.com>

* optimize mem in  uniq slot feature (PaddlePaddle#130)

* [GpuGraph] cherry pick var slot feature && fix load multi path node (PaddlePaddle#136)

* optimize mem in  uniq slot feature

* cherry-pick var slot_feature

Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com>

* [GpuGraph] fix kernel overflow (PaddlePaddle#138)

* optimize mem in  uniq slot feature

* cherry-pick var slot_feature

* fix kernel overflow && add max feature num flag

Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com>

* fix ssd cache;test=develop (PaddlePaddle#139)

* slot feature secondary storage (PaddlePaddle#140)

* slot feature secondary storage

* slot feature secondary storage

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0008.yq01.baidu.com>
Co-authored-by: xuewujiao <105861147+xuewujiao@users.noreply.github.com>
Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com>
Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>
Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com>
Co-authored-by: yangjunchao <yangjunchao@baidu.com>
Co-authored-by: Thunderbrook <52529258+Thunderbrook@users.noreply.github.com>
Co-authored-by: danleifeng <52735331+danleifeng@users.noreply.github.com>
Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com>
zmxdream pushed a commit to zmxdream/Paddle that referenced this issue Dec 7, 2022
* Optimizing the zero key problem in the push phase

* Optimize CUDA thread parallelism in MergeGrad phase

* Optimize CUDA thread parallelism in MergeGrad phase

* Performance optimization, segment gradient merging

* Performance optimization, segment gradient merging

* Optimize pullsparse and increase keys aggregation

* sync gpugraph to gpugraph_v2 (PaddlePaddle#86)

* change load node and edge from local to cpu (PaddlePaddle#83)

* change load node and edge

* remove useless code

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* extract pull sparse as single stage(PaddlePaddle#85)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com>
Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>
Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com>
Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* [GPUGraph] graph sample v2 (PaddlePaddle#87)

* change load node and edge from local to cpu (PaddlePaddle#83)

* change load node and edge

* remove useless code

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* extract pull sparse as single stage(PaddlePaddle#85)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* support ssdsparsetable;test=develop (PaddlePaddle#81)

* graph sample v2

* remove log

Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com>
Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>
Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com>
Co-authored-by: yangjunchao <yangjunchao@baidu.com>
Co-authored-by: danleifeng <52735331+danleifeng@users.noreply.github.com>

* Release cpu graph

* uniq nodeid (PaddlePaddle#89)

* compatible whole HBM mode (PaddlePaddle#91)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* Gpugraph v2 (PaddlePaddle#93)

* compatible whole HBM mode

* unify flag for graph emd storage mode and graph struct storage mode

* format

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* split generate batch into multi stage (PaddlePaddle#92)

* split generate batch into multi stage

* fix conflict

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* [GpuGraph] Uniq feature (PaddlePaddle#95)

* uniq feature

* uniq feature

* uniq feature

* [GpuGraph]  global startid (PaddlePaddle#98)

* uniq feature

* uniq feature

* uniq feature

* global startid

* load node edge seperately and release graph (PaddlePaddle#99)

* load node edge seperately and release graph

* load node edge seperately and release graph

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* v2 infer (PaddlePaddle#102)

* optimize begin pass and end pass (PaddlePaddle#106)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* fix ins no (PaddlePaddle#104)

* [GPUGraph] fix FillOneStep args (PaddlePaddle#107)

* fix ins no

* fix FillOnestep args

* fix bug for whole hbm mode (PaddlePaddle#110)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* [GPUGraph] fix infer && add infer_table_cap (PaddlePaddle#108)

* fix ins no

* fix FillOnestep args

* fix infer && add infer table cap

* fix infer

* 【PSCORE】perform ssd sparse table  (PaddlePaddle#111)

* perform ssd sparsetable;test=develop

Conflicts:
	paddle/fluid/framework/fleet/ps_gpu_wrapper.cc

* perform ssd sparsetable;test=develop

* remove debug code;

* remove debug code;

* add jemalloc cmake;test=develop

* fix wrapper;test=develop

* fix sample core (PaddlePaddle#114)

* [GpuGraph] optimize shuffle batch (PaddlePaddle#115)

* fix sample core

* optimize shuffle batch

* release gpu mem when sample end (PaddlePaddle#116)

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* fix class not found err (PaddlePaddle#118)

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* optimize sample (PaddlePaddle#117)

* optimize sample

* optimize sample

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* fix clear gpu mem (PaddlePaddle#119)

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* fix sample core (PaddlePaddle#121)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* add ssd cache (PaddlePaddle#123)

* add ssd cache;test=develop

* add ssd cache;test=develop

* add ssd cache;test=develop

* add multi epoch train & fix train table change ins & save infer embeding  (PaddlePaddle#129)

* add multi epoch train & fix train table change ins & save infer embedding

* change epoch finish judge

* change epoch finish change

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* Add debug log (PaddlePaddle#131)

* Add debug log

* Add debug log

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0008.yq01.baidu.com>

* optimize mem in  uniq slot feature (PaddlePaddle#130)

* [GpuGraph] cherry pick var slot feature && fix load multi path node (PaddlePaddle#136)

* optimize mem in  uniq slot feature

* cherry-pick var slot_feature

Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com>

* [GpuGraph] fix kernel overflow (PaddlePaddle#138)

* optimize mem in  uniq slot feature

* cherry-pick var slot_feature

* fix kernel overflow && add max feature num flag

Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com>

* fix ssd cache;test=develop (PaddlePaddle#139)

* slot feature secondary storage (PaddlePaddle#140)

* slot feature secondary storage

* slot feature secondary storage

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0008.yq01.baidu.com>
Co-authored-by: xuewujiao <105861147+xuewujiao@users.noreply.github.com>
Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com>
Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>
Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com>
Co-authored-by: yangjunchao <yangjunchao@baidu.com>
Co-authored-by: Thunderbrook <52529258+Thunderbrook@users.noreply.github.com>
Co-authored-by: danleifeng <52735331+danleifeng@users.noreply.github.com>
Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com>
qizhaoaoe pushed a commit to qizhaoaoe/Paddle that referenced this issue Mar 3, 2023
* clear repo

* init plsc v2.2.0

* update README.md
qingshui pushed a commit to jiaoxuewu/PaddleBox that referenced this issue Dec 1, 2023
* sharding mode, erase other device param

* sharding mode support avg_weight
AnnaTrainingG pushed a commit to AnnaTrainingG/Paddle that referenced this issue Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants