Cuda Error: an illegal memory access was encountered #182

jamestang0219 · 2016-10-10T05:58:08Z

When I'm training the LSTM model, the error occur. Here is the partial training log:

`I1010 05:43:00.041766 119630 TrainerInternal.cpp:204] ___fc_layer_0__.w0   avg_abs_val=0.175171    max_val=1.02055     avg_abs_grad=0.0117175   max_grad=1.19871
I1010 05:43:00.042105 119630 TrainerInternal.cpp:204] ___fc_layer_0__.wbias avg_abs_val=0.127698    max_val=0.430612    avg_abs_grad=0.0627174   max_grad=1.82515
I1010 05:43:00.042553 119630 TrainerInternal.cpp:204] ___lstmemory_0__.w0  avg_abs_val=0.133314    max_val=0.794175    avg_abs_grad=0.0206954   max_grad=2.81496
I1010 05:43:00.042837 119630 TrainerInternal.cpp:204] ___lstmemory_0__.wbias avg_abs_val=0.115955    max_val=0.508302    avg_abs_grad=0.101172    max_grad=14.5017
I1010 05:43:00.043148 119630 TrainerInternal.cpp:204] ___fc_layer_1__.w0   avg_abs_val=0.274876    max_val=0.900992    avg_abs_grad=0.382739    max_grad=2.3795
I1010 05:43:00.043421 119630 TrainerInternal.cpp:204] ___fc_layer_1__.wbias avg_abs_val=0.217184    max_val=0.217184    avg_abs_grad=0.373559    max_grad=0.37356

I1010 05:43:00.043450 119630 TrainerInternal.cpp:162]  Batch=6400 samples=819200 AvgCost=0.294956 CurrentCost=0.31619 Eval: classification_error_evaluator=0.128513  CurrentEval: classification_error_evaluator=0.1375
...................
I1010 05:43:06.412021 119630 TrainerInternal.cpp:162]  Batch=6420 samples=821760 AvgCost=0.294991 CurrentCost=0.30623 Eval: classification_error_evaluator=0.128552  CurrentEval: classification_error_evaluator=0.141016
...................
I1010 05:43:13.165990 119630 TrainerInternal.cpp:162]  Batch=6440 samples=824320 AvgCost=0.29512 CurrentCost=0.336758 Eval: classification_error_evaluator=0.128653  CurrentEval: classification_error_evaluator=0.160938
...................
I1010 05:43:19.573108 119630 TrainerInternal.cpp:162]  Batch=6460 samples=826880 AvgCost=0.295171 CurrentCost=0.311461 Eval: classification_error_evaluator=0.128692  CurrentEval: classification_error_evaluator=0.141406
...................F1010 05:43:26.722699 119642 hl_cuda_device.cc:646] Check failed: cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encountered
*** Check failure stack trace: ***
    @     0x7f1c2eccadaa  (unknown)
    @     0x7f1c2eccace4  (unknown)
    @     0x7f1c2ecca6e6  (unknown)
    @     0x7f1c2eccd687  (unknown)
    @           0x8ae00b  hl_stream_synchronize()
    @           0x8ca3b0  hl_max_sequence_backward()
    @           0x6c860e  paddle::GpuMatrix::maxSequenceBackward()
    @           0x5f71db  paddle::MaxLayer::backward()
    @           0x67228e  paddle::NeuralNetwork::backward()
    @           0x65324c  paddle::TrainerThread::backward()
    @           0x65337d  paddle::TrainerThread::computeThread()
    @     0x7f1c2e847a60  (unknown)
    @     0x7f1c2f883184  start_thread
    @     0x7f1c2dfaf37d  (unknown)
    @              (nil)  (unknown)
/usr/local/bin/paddle: line 46: 119630 Aborted                 (core dumped) ${DEBUGGER} $MYDIR/../opt/paddle/bin/paddle_trainer ${@:2}`

and I checked nvidia source manager before the error occured:

Mon Oct 10 05:12:26 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.93     Driver Version: 352.93         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   52C    P0    66W / 125W |   1497MiB /  4095MiB |     41%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GRID K520           Off  | 0000:00:04.0     Off |                  N/A |
| N/A   56C    P0    55W / 125W |   1257MiB /  4095MiB |     47%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GRID K520           Off  | 0000:00:05.0     Off |                  N/A |
| N/A   54C    P0    61W / 125W |   1079MiB /  4095MiB |     76%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GRID K520           Off  | 0000:00:06.0     Off |                  N/A |
| N/A   58C    P0    60W / 125W |   1049MiB /  4095MiB |     70%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0    119630    C   ...ocal/bin/../opt/paddle/bin/paddle_trainer  1484MiB |
|    1    119630    C   ...ocal/bin/../opt/paddle/bin/paddle_trainer  1243MiB |
|    2    119630    C   ...ocal/bin/../opt/paddle/bin/paddle_trainer  1066MiB |
|    3    119630    C   ...ocal/bin/../opt/paddle/bin/paddle_trainer  1036MiB |
+-----------------------------------------------------------------------------+`

The model can be initialized successfully, but when Paddle trained samples, the error will occur.

I've tried several times, each time get the same error.

The text was updated successfully, but these errors were encountered:

backyes · 2016-10-10T10:45:27Z

Please show us more details, such as command line, paddle version, model config file.

jamestang0219 · 2016-10-10T10:49:31Z

@backyes
paddle version:
=> paddle version PaddlePaddle 0.8.0b, compiled with with_avx: ON with_gpu: ON with_double: OFF with_python: ON with_rdma: OFF with_glog: ON with_gflags: ON with_metric_learning: with_timer: OFF with_predict_sdk:

cmd line:
20 model=lstm.py 21 paddle train \ 22 --config=$model \ 23 --save_dir=./2classoutput \ 24 --trainer_count=4 \ 25 --log_period=100 \ 26 --num_passes=16 \ 27 --use_gpu=true \ 28 --show_parameter_stats_period=1000 \ 29 --test_all_data_in_one_period=1 \ 30 2>&1 | tee 'lstm_train_2class.log'

hedaoyuan · 2016-10-10T11:11:18Z

hi @jamestang0219
Try the cpu model with --use_gpu=false. Look at whether there will be the same error.

jamestang0219 · 2016-10-10T11:13:15Z

@hedaoyuan without gpu, it works well but too slow. i wanna use 4 gpus to boost the train.

hedaoyuan · 2016-10-10T12:10:13Z

There may be a bug with this hl_max_sequence_backward api, but not sure what kind of input data will cause the problem of illegal memory access. @jamestang0219 can you try --trainer_count=1 --use_gpu=true, whether it is the same problem.

jamestang0219 · 2016-10-10T13:25:13Z

@hedaoyuan same error occurred. but I change the batch size, the problem never occur. I don't think the batch size will cause cuda's error, maybe there are some bugs in hl_max_sequence_backward api.

hedaoyuan · 2016-11-07T10:17:19Z

@jamestang0219
This commit adds some checking code to determine if there is problem with the input value.
Can you try with this commit?

jamestang0219 · 2016-11-07T10:31:46Z

@hedaoyuan
hello, I'm wondering if there are some problems with the input value, the error must appear every time that I use the same data to train, but it didn't appear for each time. Only for a large data set, for example more that 500,000 sentences data, this error may occur. If I split the large data set into 2 or more pieces, and only train one split, this error never occur.

hedaoyuan · 2016-11-07T10:36:11Z

There may be problem with the input value, but not every time a memory error occurs.

jamestang0219 · 2016-11-07T10:38:40Z

@hedaoyuan
the input value will be mapped to the embedding vector, you mean in this procedure the error occurred?

hedaoyuan · 2016-11-07T10:52:49Z

@jamestang0219 May be.
hl_max_sequence_backward interface is relatively simple, only when there is a problem in the input value, will encounter memory cross-border issues.
The root cause of this problem, may be input is a sequence with length equal zero.
I do not know how to check your data, however, if the program runs into these two errors(assert(tmpId >= 0); assert(tmpId < inputHeight);), then you can narrow the scope of the problem.

jamestang0219 · 2016-11-07T10:56:45Z

@hedaoyuan
Thank you, I will try to train my data set again using ur branch, maybe tomorrow. If I get the result, I will reply you in this issue topic.

reyoung · 2016-11-21T10:55:38Z

No response for too long, reopen if there is still some problems here.

update directory structure

add ernie_varlen test file

* add get_epoch_finish interface * add return * delete return

* add set slot_num for psgpuwraper (#177) * add set slot_num_for_pull_feature for psgpuwarper * Add get_epoch_finish python interface (#182) * add get_epoch_finish interface * add return * delete return * add unzip op (#183) * fix miss key for error dataset (#186) * fix miss key for error dataset * fix miss key for error dataset Co-authored-by: yangjunchao <yangjunchao@baidu.com> * add excluded_train_pair and infer_node_type (#187) * support return of degree (#188) * fix task stuck in barrier (#189) Co-authored-by: yangjunchao <yangjunchao@baidu.com> * check node/feature format when loading (#190) * check node&feature format when loading * check node&feature format when loading (2£ (2) * degrade log (#191) * [PGLBOX]fix conflict * [PGLBOX]fix conflict * [PGLBOX]replace LodTensor with phi::DenseTensor * [PGLBOX]fix gpu_primitives.h include path * [PGLBOX]from platform::PADDLE_CUDA_NUM_THREADS to phi::PADDLE_CUDA_NUM_THREADS * [PGLBOX]fix unzip example code * [PGLBOX]fix unzip example code * [PGLBOX]fix unzip example code * [PGLBOX]fix unzip example code * [PGLBOX]fix unzip ut * [PGLBOX]fix unzip ut * [PGLBOX]fix code style * [PGLBOX]fix code style * [PGLBOX]fix code style * fix code style * fix code style * fix unzip ut * fix unzip ut * fix unzip ut * fix unzip * fix code stype * add ut * add c++ ut & fix train_mode_ set * fix load into memory * fix c++ ut * fix c++ ut * fix c++ ut * fix c++ ut * fix code style * fix collective * fix unzip_op.cc * fix barrier * fix code style * fix barrier * fix barrier * fix code styple * fix unzip * add unzip.py * add unzip.py * fix unzip.py --------- Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com> Co-authored-by: Siming Dai <908660116@qq.com> Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com> Co-authored-by: yangjunchao <yangjunchao@baidu.com>

backyes added the NeedMoreDetails label Oct 10, 2016

backyes assigned gangliao and hedaoyuan Oct 10, 2016

backyes assigned luotao1 Oct 10, 2016

backyes added the Bug label Oct 11, 2016

hedaoyuan mentioned this issue Nov 7, 2016

Check failed: cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encountered #346

Closed

reyoung closed this as completed Nov 21, 2016

zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this issue Sep 25, 2019

Merge pull request PaddlePaddle#182 from tink2123/develog_preview

42534b1

update directory structure

gglin001 pushed a commit to graphcore/Paddle-fork that referenced this issue Dec 8, 2021

add optimizer_state_align_pass (PaddlePaddle#182)

9c243aa

zhoutianzi666 pushed a commit to zhoutianzi666/Paddle that referenced this issue May 23, 2022

Merge pull request PaddlePaddle#182 from JZZ-NOTE/sparsity_demo

58d7cc2

add ernie_varlen test file

DesmonDay added a commit to DesmonDay/Paddle that referenced this issue Dec 8, 2022

Add get_epoch_finish python interface (PaddlePaddle#182)

8b632a2

* add get_epoch_finish interface * add return * delete return

zmxdream pushed a commit to zmxdream/Paddle that referenced this issue Dec 24, 2022

Add get_epoch_finish python interface (PaddlePaddle#182)

8650d4d

* add get_epoch_finish interface * add return * delete return

qizhaoaoe pushed a commit to qizhaoaoe/Paddle that referenced this issue Mar 3, 2023

fix readme (PaddlePaddle#182)

99688ba

lizexu123 pushed a commit to lizexu123/Paddle that referenced this issue Feb 23, 2024

Optimize knowledge receiving in pantheon student (PaddlePaddle#182)

45faf9b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda Error: an illegal memory access was encountered #182

Cuda Error: an illegal memory access was encountered #182

jamestang0219 commented Oct 10, 2016 •

edited by wangkuiyi

Loading

backyes commented Oct 10, 2016

jamestang0219 commented Oct 10, 2016

hedaoyuan commented Oct 10, 2016

jamestang0219 commented Oct 10, 2016

hedaoyuan commented Oct 10, 2016

jamestang0219 commented Oct 10, 2016

hedaoyuan commented Nov 7, 2016

jamestang0219 commented Nov 7, 2016

hedaoyuan commented Nov 7, 2016

jamestang0219 commented Nov 7, 2016

hedaoyuan commented Nov 7, 2016

jamestang0219 commented Nov 7, 2016

reyoung commented Nov 21, 2016

Cuda Error: an illegal memory access was encountered #182

Cuda Error: an illegal memory access was encountered #182

Comments

jamestang0219 commented Oct 10, 2016 • edited by wangkuiyi Loading

backyes commented Oct 10, 2016

jamestang0219 commented Oct 10, 2016

hedaoyuan commented Oct 10, 2016

jamestang0219 commented Oct 10, 2016

hedaoyuan commented Oct 10, 2016

jamestang0219 commented Oct 10, 2016

hedaoyuan commented Nov 7, 2016

jamestang0219 commented Nov 7, 2016

hedaoyuan commented Nov 7, 2016

jamestang0219 commented Nov 7, 2016

hedaoyuan commented Nov 7, 2016

jamestang0219 commented Nov 7, 2016

reyoung commented Nov 21, 2016

jamestang0219 commented Oct 10, 2016 •

edited by wangkuiyi

Loading