-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
image_classification cifar-10 train error #95
Labels
Comments
Hi, can you paste your GPU model number? For instance, GTX TITANX? |
GTX970 |
I update the cudnn to libcudnn.so.5.1.3 but still get the same issue. |
I already saw several users have the same problem. But, we still can not reproduce this issue. We will try to reproduce and fix it as soon as possible. |
zhhsplendid
pushed a commit
to zhhsplendid/Paddle
that referenced
this issue
Sep 25, 2019
* update dist train doc * update * fix style * update
thisjiang
pushed a commit
to thisjiang/Paddle
that referenced
this issue
Oct 28, 2021
* add Split test to test02 * add vectorize to test01 and add align to buffer
wangxicoding
pushed a commit
to wangxicoding/Paddle
that referenced
this issue
Dec 9, 2021
* update couplet dataset and example * add couplet dataset * delete target vocab info, cause couplet only has one vocab * add copyright info for couplet data, and fix comment typo * Update environment dependency information * update save dir in seq2seq readme * fix couplet convert_example func bug * fix vocab typo
Thunderbrook
added a commit
to Thunderbrook/Paddle
that referenced
this issue
Aug 31, 2022
* uniq feature * uniq feature * uniq feature
AnnaTrainingG
pushed a commit
to AnnaTrainingG/Paddle
that referenced
this issue
Sep 19, 2022
* add some custom gan models
qingshui
referenced
this issue
in qingshui/Paddle
Nov 14, 2022
* Optimizing the zero key problem in the push phase * Optimize CUDA thread parallelism in MergeGrad phase * Optimize CUDA thread parallelism in MergeGrad phase * Performance optimization, segment gradient merging * Performance optimization, segment gradient merging * Optimize pullsparse and increase keys aggregation * sync gpugraph to gpugraph_v2 (#86) * change load node and edge from local to cpu (#83) * change load node and edge * remove useless code Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * extract pull sparse as single stage(#85) Co-authored-by: yangjunchao <yangjunchao@baidu.com> Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com> Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com> Co-authored-by: yangjunchao <yangjunchao@baidu.com> * [GPUGraph] graph sample v2 (#87) * change load node and edge from local to cpu (#83) * change load node and edge * remove useless code Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * extract pull sparse as single stage(#85) Co-authored-by: yangjunchao <yangjunchao@baidu.com> * support ssdsparsetable;test=develop (#81) * graph sample v2 * remove log Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com> Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com> Co-authored-by: yangjunchao <yangjunchao@baidu.com> Co-authored-by: danleifeng <52735331+danleifeng@users.noreply.github.com> * Release cpu graph * uniq nodeid (#89) * compatible whole HBM mode (#91) Co-authored-by: yangjunchao <yangjunchao@baidu.com> * Gpugraph v2 (#93) * compatible whole HBM mode * unify flag for graph emd storage mode and graph struct storage mode * format Co-authored-by: yangjunchao <yangjunchao@baidu.com> * split generate batch into multi stage (#92) * split generate batch into multi stage * fix conflict Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * [GpuGraph] Uniq feature (#95) * uniq feature * uniq feature * uniq feature * [GpuGraph] global startid (#98) * uniq feature * uniq feature * uniq feature * global startid * load node edge seperately and release graph (#99) * load node edge seperately and release graph * load node edge seperately and release graph Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * v2 infer (#102) * optimize begin pass and end pass (#106) Co-authored-by: yangjunchao <yangjunchao@baidu.com> * fix ins no (#104) * [GPUGraph] fix FillOneStep args (#107) * fix ins no * fix FillOnestep args * fix bug for whole hbm mode (#110) Co-authored-by: yangjunchao <yangjunchao@baidu.com> * [GPUGraph] fix infer && add infer_table_cap (#108) * fix ins no * fix FillOnestep args * fix infer && add infer table cap * fix infer * 【PSCORE】perform ssd sparse table (#111) * perform ssd sparsetable;test=develop Conflicts: paddle/fluid/framework/fleet/ps_gpu_wrapper.cc * perform ssd sparsetable;test=develop * remove debug code; * remove debug code; * add jemalloc cmake;test=develop * fix wrapper;test=develop * fix sample core (#114) * [GpuGraph] optimize shuffle batch (#115) * fix sample core * optimize shuffle batch * release gpu mem when sample end (#116) Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * fix class not found err (PaddlePaddle#118) Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * optimize sample (PaddlePaddle#117) * optimize sample * optimize sample Co-authored-by: yangjunchao <yangjunchao@baidu.com> * fix clear gpu mem (PaddlePaddle#119) Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * fix sample core (PaddlePaddle#121) Co-authored-by: yangjunchao <yangjunchao@baidu.com> * add ssd cache (PaddlePaddle#123) * add ssd cache;test=develop * add ssd cache;test=develop * add ssd cache;test=develop * add multi epoch train & fix train table change ins & save infer embeding (PaddlePaddle#129) * add multi epoch train & fix train table change ins & save infer embedding * change epoch finish judge * change epoch finish change Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * Add debug log (PaddlePaddle#131) * Add debug log * Add debug log Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0008.yq01.baidu.com> * optimize mem in uniq slot feature (PaddlePaddle#130) * [GpuGraph] cherry pick var slot feature && fix load multi path node (PaddlePaddle#136) * optimize mem in uniq slot feature * cherry-pick var slot_feature Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com> * [GpuGraph] fix kernel overflow (PaddlePaddle#138) * optimize mem in uniq slot feature * cherry-pick var slot_feature * fix kernel overflow && add max feature num flag Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com> * fix ssd cache;test=develop (PaddlePaddle#139) * slot feature secondary storage (PaddlePaddle#140) * slot feature secondary storage * slot feature secondary storage Co-authored-by: yangjunchao <yangjunchao@baidu.com> Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0008.yq01.baidu.com> Co-authored-by: xuewujiao <105861147+xuewujiao@users.noreply.github.com> Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com> Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com> Co-authored-by: yangjunchao <yangjunchao@baidu.com> Co-authored-by: Thunderbrook <52529258+Thunderbrook@users.noreply.github.com> Co-authored-by: danleifeng <52735331+danleifeng@users.noreply.github.com> Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com>
zmxdream
pushed a commit
to zmxdream/Paddle
that referenced
this issue
Dec 7, 2022
* Optimizing the zero key problem in the push phase * Optimize CUDA thread parallelism in MergeGrad phase * Optimize CUDA thread parallelism in MergeGrad phase * Performance optimization, segment gradient merging * Performance optimization, segment gradient merging * Optimize pullsparse and increase keys aggregation * sync gpugraph to gpugraph_v2 (PaddlePaddle#86) * change load node and edge from local to cpu (PaddlePaddle#83) * change load node and edge * remove useless code Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * extract pull sparse as single stage(PaddlePaddle#85) Co-authored-by: yangjunchao <yangjunchao@baidu.com> Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com> Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com> Co-authored-by: yangjunchao <yangjunchao@baidu.com> * [GPUGraph] graph sample v2 (PaddlePaddle#87) * change load node and edge from local to cpu (PaddlePaddle#83) * change load node and edge * remove useless code Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * extract pull sparse as single stage(PaddlePaddle#85) Co-authored-by: yangjunchao <yangjunchao@baidu.com> * support ssdsparsetable;test=develop (PaddlePaddle#81) * graph sample v2 * remove log Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com> Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com> Co-authored-by: yangjunchao <yangjunchao@baidu.com> Co-authored-by: danleifeng <52735331+danleifeng@users.noreply.github.com> * Release cpu graph * uniq nodeid (PaddlePaddle#89) * compatible whole HBM mode (PaddlePaddle#91) Co-authored-by: yangjunchao <yangjunchao@baidu.com> * Gpugraph v2 (PaddlePaddle#93) * compatible whole HBM mode * unify flag for graph emd storage mode and graph struct storage mode * format Co-authored-by: yangjunchao <yangjunchao@baidu.com> * split generate batch into multi stage (PaddlePaddle#92) * split generate batch into multi stage * fix conflict Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * [GpuGraph] Uniq feature (PaddlePaddle#95) * uniq feature * uniq feature * uniq feature * [GpuGraph] global startid (PaddlePaddle#98) * uniq feature * uniq feature * uniq feature * global startid * load node edge seperately and release graph (PaddlePaddle#99) * load node edge seperately and release graph * load node edge seperately and release graph Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * v2 infer (PaddlePaddle#102) * optimize begin pass and end pass (PaddlePaddle#106) Co-authored-by: yangjunchao <yangjunchao@baidu.com> * fix ins no (PaddlePaddle#104) * [GPUGraph] fix FillOneStep args (PaddlePaddle#107) * fix ins no * fix FillOnestep args * fix bug for whole hbm mode (PaddlePaddle#110) Co-authored-by: yangjunchao <yangjunchao@baidu.com> * [GPUGraph] fix infer && add infer_table_cap (PaddlePaddle#108) * fix ins no * fix FillOnestep args * fix infer && add infer table cap * fix infer * 【PSCORE】perform ssd sparse table (PaddlePaddle#111) * perform ssd sparsetable;test=develop Conflicts: paddle/fluid/framework/fleet/ps_gpu_wrapper.cc * perform ssd sparsetable;test=develop * remove debug code; * remove debug code; * add jemalloc cmake;test=develop * fix wrapper;test=develop * fix sample core (PaddlePaddle#114) * [GpuGraph] optimize shuffle batch (PaddlePaddle#115) * fix sample core * optimize shuffle batch * release gpu mem when sample end (PaddlePaddle#116) Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * fix class not found err (PaddlePaddle#118) Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * optimize sample (PaddlePaddle#117) * optimize sample * optimize sample Co-authored-by: yangjunchao <yangjunchao@baidu.com> * fix clear gpu mem (PaddlePaddle#119) Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * fix sample core (PaddlePaddle#121) Co-authored-by: yangjunchao <yangjunchao@baidu.com> * add ssd cache (PaddlePaddle#123) * add ssd cache;test=develop * add ssd cache;test=develop * add ssd cache;test=develop * add multi epoch train & fix train table change ins & save infer embeding (PaddlePaddle#129) * add multi epoch train & fix train table change ins & save infer embedding * change epoch finish judge * change epoch finish change Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> * Add debug log (PaddlePaddle#131) * Add debug log * Add debug log Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0008.yq01.baidu.com> * optimize mem in uniq slot feature (PaddlePaddle#130) * [GpuGraph] cherry pick var slot feature && fix load multi path node (PaddlePaddle#136) * optimize mem in uniq slot feature * cherry-pick var slot_feature Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com> * [GpuGraph] fix kernel overflow (PaddlePaddle#138) * optimize mem in uniq slot feature * cherry-pick var slot_feature * fix kernel overflow && add max feature num flag Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com> * fix ssd cache;test=develop (PaddlePaddle#139) * slot feature secondary storage (PaddlePaddle#140) * slot feature secondary storage * slot feature secondary storage Co-authored-by: yangjunchao <yangjunchao@baidu.com> Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0008.yq01.baidu.com> Co-authored-by: xuewujiao <105861147+xuewujiao@users.noreply.github.com> Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com> Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com> Co-authored-by: yangjunchao <yangjunchao@baidu.com> Co-authored-by: Thunderbrook <52529258+Thunderbrook@users.noreply.github.com> Co-authored-by: danleifeng <52735331+danleifeng@users.noreply.github.com> Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com>
qizhaoaoe
pushed a commit
to qizhaoaoe/Paddle
that referenced
this issue
Mar 3, 2023
* clear repo * init plsc v2.2.0 * update README.md
qingshui
pushed a commit
to jiaoxuewu/PaddleBox
that referenced
this issue
Dec 1, 2023
* sharding mode, erase other device param * sharding mode support avg_weight
AnnaTrainingG
pushed a commit
to AnnaTrainingG/Paddle
that referenced
this issue
Dec 6, 2023
Add gpt-neox adoption
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Follow the guide and run train.sh in Paddle/demo/image_classification, after around 1 minutes, I get a cudnn error.
My env:
OS ubuntu 14.04.4
cuda: 7.5
cudnn: libcudnn.so.5.0.5
yu@baymax:~/workspace/Paddle/demo/image_classification$ ./train.sh
I0919 23:32:31.849189 32690 Util.cpp:138] commandline: /opt/paddle/bin/paddle_trainer --config=vgg_16_cifar.py --dot_period=10 --log_period=100 --test_all_data_in_one_period=1 --use_gpu=1 --trainer_count=1 --num_passes=200 --save_dir=./cifar_vgg_model
I0919 23:32:32.094377 32690 Util.cpp:113] Calling runInitFunctions
I0919 23:32:32.094532 32690 Util.cpp:126] Call runInitFunctions done.
[INFO 2016-09-19 23:32:32,122 layers.py:1499] channels=3 size=3072
[INFO 2016-09-19 23:32:32,123 layers.py:1499] output size for conv_0 is 32
[INFO 2016-09-19 23:32:32,123 layers.py:1499] channels=64 size=65536
[INFO 2016-09-19 23:32:32,124 layers.py:1499] output size for conv_1 is 32
[INFO 2016-09-19 23:32:32,125 layers.py:1560] output size for pool_0 is 16_16
[INFO 2016-09-19 23:32:32,125 layers.py:1499] channels=64 size=16384
[INFO 2016-09-19 23:32:32,125 layers.py:1499] output size for conv_2 is 16
[INFO 2016-09-19 23:32:32,126 layers.py:1499] channels=128 size=32768
[INFO 2016-09-19 23:32:32,126 layers.py:1499] output size for conv_3 is 16
[INFO 2016-09-19 23:32:32,127 layers.py:1560] output size for pool_1 is 8_8
[INFO 2016-09-19 23:32:32,127 layers.py:1499] channels=128 size=8192
[INFO 2016-09-19 23:32:32,128 layers.py:1499] output size for conv_4 is 8
[INFO 2016-09-19 23:32:32,128 layers.py:1499] channels=256 size=16384
[INFO 2016-09-19 23:32:32,129 layers.py:1499] output size for conv_5 is 8
[INFO 2016-09-19 23:32:32,129 layers.py:1499] channels=256 size=16384
[INFO 2016-09-19 23:32:32,130 layers.py:1499] output size for conv_6 is 8
[INFO 2016-09-19 23:32:32,130 layers.py:1560] output size for pool_2 is 4_4
[INFO 2016-09-19 23:32:32,131 layers.py:1499] channels=256 size=4096
[INFO 2016-09-19 23:32:32,131 layers.py:1499] output size for conv_7 is 4
[INFO 2016-09-19 23:32:32,132 layers.py:1499] channels=512 size=8192
[INFO 2016-09-19 23:32:32,132 layers.py:1499] output size for conv_8 is 4
[INFO 2016-09-19 23:32:32,133 layers.py:1499] channels=512 size=8192
[INFO 2016-09-19 23:32:32,133 layers.py:1499] output size for conv_9 is 4
[INFO 2016-09-19 23:32:32,134 layers.py:1560] output size for pool_3 is 2_2
[INFO 2016-09-19 23:32:32,134 layers.py:1560] output size for pool_4 is 1_1
[INFO 2016-09-19 23:32:32,136 networks.py:1122] The input order is [image, label]
[INFO 2016-09-19 23:32:32,136 networks.py:1129] The output order is [cost_0]
I0919 23:32:32.141587 32690 Trainer.cpp:169] trainer mode: Normal
I0919 23:32:32.147541 32690 PyDataProvider2.cpp:219] loading dataprovider image_provider::processData
[INFO 2016-09-19 23:32:32,168 image_provider.py:52] Image size: 32
[INFO 2016-09-19 23:32:32,168 image_provider.py:53] Meta path: data/cifar-out/batches/batches.meta
[INFO 2016-09-19 23:32:32,169 image_provider.py:58] DataProvider Initialization finished
I0919 23:32:32.169145 32690 PyDataProvider2.cpp:219] loading dataprovider image_provider::processData
[INFO 2016-09-19 23:32:32,169 image_provider.py:52] Image size: 32
[INFO 2016-09-19 23:32:32,169 image_provider.py:53] Meta path: data/cifar-out/batches/batches.meta
[INFO 2016-09-19 23:32:32,169 image_provider.py:58] DataProvider Initialization finished
I0919 23:32:32.169431 32690 GradientMachine.cpp:134] Initing parameters..
I0919 23:32:32.531777 32690 GradientMachine.cpp:141] Init parameters done.
.........
I0919 23:32:53.036734 32690 TrainerInternal.cpp:162] Batch=100 samples=12800 AvgCost=2.38877 CurrentCost=2.38877 Eval: classification_error_evaluator=0.834219 CurrentEval: classification_error_evaluator=0.834219
.........
I0919 23:33:03.698333 32690 TrainerInternal.cpp:162] Batch=200 samples=25600 AvgCost=2.17996 CurrentCost=1.97115 Eval: classification_error_evaluator=0.786719 CurrentEval: classification_error_evaluator=0.739219
.........
I0919 23:33:14.348435 32690 TrainerInternal.cpp:162] Batch=300 samples=38400 AvgCost=2.02001 CurrentCost=1.7001 Eval: classification_error_evaluator=0.7425 CurrentEval: classification_error_evaluator=0.654062
.........I0919 23:33:23.978550 32690 TrainerInternal.cpp:179] Pass=0 Batch=391 samples=50048 AvgCost=1.90913 Eval: classification_error_evaluator=0.70658
F0919 23:33:26.893599 32690 hl_cuda_cudnn.cc:779] Check failed: CUDNN_STATUS_SUCCESS == cudnnStat (0 vs. 5) Cudnn Error: CUDNN_STATUS_INVALID_VALUE
*_* Check failure stack trace: ***
@ 0x7efcce9e2daa (unknown)
@ 0x7efcce9e2ce4 (unknown)
@ 0x7efcce9e26e6 (unknown)
@ 0x7efcce9e5687 (unknown)
@ 0x8affc4 hl_convolution_forward()
@ 0x64bb3c paddle::CudnnConvLayer::forward()
@ 0x5a2130 paddle::NeuralNetwork::forward()
@ 0x6bb43f paddle::Tester::testOneBatch()
@ 0x6bbd52 paddle::Tester::testOnePeriod()
@ 0x6a052c paddle::Trainer::trainOnePass()
@ 0x6a3927 paddle::Trainer::train()
@ 0x53bfd3 main
@ 0x7efccdbeef45 (unknown)
@ 0x547795 (unknown)
@ (nil) (unknown)
/usr/local/bin/paddle: line 81: 32690 Aborted (core dumped) ${DEBUGGER} /opt/paddle/bin/paddle_trainer ${@:2}
The text was updated successfully, but these errors were encountered: