1，trainer_count=N是对应使用N个GPU吗？2，为啥使用6个GPU抛内存错误，使用1-4个GPU就能正常训练？ #307

buptzzl · 2016-11-02T06:50:05Z

1，trainer_count=N是对应使用N个GPU吗？2，为啥使用6个GPU抛内存错误，使用1-4个GPU就能正常训练？

gangliao · 2016-11-02T06:55:44Z

文档最下面：
http://www.paddlepaddle.org/doc/ui/cmd_argument/use_case.html?highlight=trainer_count
请问你们机器上有多少张GPU卡？另外，请粘贴详细的错误信息谢谢。

hedaoyuan · 2016-11-02T07:46:37Z

@buptzzl

trainer_count=N，指的是用N个训练线程来计算，如果你的系统有多于N个GPU，那么就会用N个GPU来训练。如果系统少于N个GPU，默认是会报错（一般也不建议N > GPU个数）。paddle程序运行N > GPU个数的方式需要在命令行增加·--allow_only_one_model_on_one_gpu=false·参数。
为啥使用6个GPU抛内存错误，这个具体是报了什么样的内存错误？

buptzzl · 2016-11-02T08:56:04Z

@gangliao 机器有16块GPU，24个CPU。
@hedaoyuan 在样本数较少（18w, ）时报如下错误：
I1102 16:53:13.679253 23091 TrainerInternal.cpp:179] Pass=3 Batch=911 samples=182171 AvgCost=0.23115 Eval: classification_error_evaluator=0.014179 classification_error_evaluator=0.014179 auc_evaluator_0=0.573843-
F1102 16:53:14.301820 24853 hl_cuda_device.cc:646] Check failed: cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encountered
*** Check failure stack trace: ***
@ 0x7ff725cc2a3d google::LogMessage::Fail()
@ 0x7ff725cc6ed7 google::LogMessage::SendToLog()
@ 0x7ff725cc4d39 google::LogMessage::Flush()
@ 0x7ff725cc503d google::LogMessageFatal::~LogMessageFatal()
@ 0x98585f hl_stream_synchronize()
@ 0x991ae4 hl_matrix_csr_mul_dense()
@ 0x796dac paddle::GpuMatrix::mul()
@ 0x7a8a64 paddle::GpuMatrix::mul()
@ 0x600f81 paddle::FullyConnectedLayer::forward()
@ 0x6f42b6 paddle::NeuralNetwork::forward()
@ 0x6ea073 paddle::TrainerThread::forward()
@ 0x6eb538 paddle::TrainerThread::computeThread()
@ 0x7ff72580d2a8 execute_native_thread_routine
@ 0x318b207851 (unknown)
@ 0x318aee767d (unknown)

对应的train.sh中的配置：
--trainer_count=6
--log_period=200
--num_passes=100
--use_gpu=1
--show_parameter_stats_period=100
--test_all_data_in_one_period=1 \

hedaoyuan · 2016-11-02T09:29:36Z

There is a bug fix for sparse matrix calculate #133 . Try v0.8.0beta.1.

buptzzl · 2016-11-02T11:44:24Z

@hedaoyuan 谢谢！

…patch command (PaddlePaddle#307) Improve the performance of ft custom op and use patch command

* add LapStyle Model

* sharding for infer * sharing for infer --------- Co-authored-by: yangjunchao <yangjunchao@baidu.com>

gangliao assigned hedaoyuan Nov 2, 2016

reyoung added the NeedMoreDetails label Nov 2, 2016

buptzzl closed this as completed Nov 2, 2016

hedaoyuan mentioned this issue Feb 21, 2017

cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encountered #1399

Closed

wopeizl mentioned this issue Apr 22, 2019

add parallel build script to ci … #16901

Merged

gglin001 added a commit to graphcore/Paddle-fork that referenced this issue Dec 8, 2021

export enable_engine_caching (PaddlePaddle#307)

747245a

wangxicoding pushed a commit to wangxicoding/Paddle that referenced this issue Dec 9, 2021

[Faster Transformer] Improve the performance of ft custom op and use …

6baa184

…patch command (PaddlePaddle#307) Improve the performance of ft custom op and use patch command

AnnaTrainingG pushed a commit to AnnaTrainingG/Paddle that referenced this issue Sep 19, 2022

Add LapStyle Model from vis (PaddlePaddle#307)

0ded43f

* add LapStyle Model

danleifeng pushed a commit to danleifeng/Paddle that referenced this issue Jun 13, 2023

sharding for infer (PaddlePaddle#307)

27ff72b

* sharding for infer * sharing for infer --------- Co-authored-by: yangjunchao <yangjunchao@baidu.com>

danleifeng pushed a commit to danleifeng/Paddle that referenced this issue Sep 13, 2023

sharding for infer (PaddlePaddle#307)

047b878

* sharding for infer * sharing for infer --------- Co-authored-by: yangjunchao <yangjunchao@baidu.com>

lizexu123 pushed a commit to lizexu123/Paddle that referenced this issue Feb 23, 2024

fix quant seed (PaddlePaddle#307)

49772be

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1，trainer_count=N是对应使用N个GPU吗？2，为啥使用6个GPU抛内存错误，使用1-4个GPU就能正常训练？ #307

1，trainer_count=N是对应使用N个GPU吗？2，为啥使用6个GPU抛内存错误，使用1-4个GPU就能正常训练？ #307

buptzzl commented Nov 2, 2016

gangliao commented Nov 2, 2016 •

edited

Loading

hedaoyuan commented Nov 2, 2016

buptzzl commented Nov 2, 2016 •

edited

Loading

hedaoyuan commented Nov 2, 2016

buptzzl commented Nov 2, 2016

1，trainer_count=N是对应使用N个GPU吗？2，为啥使用6个GPU抛内存错误，使用1-4个GPU就能正常训练？ #307

1，trainer_count=N是对应使用N个GPU吗？2，为啥使用6个GPU抛内存错误，使用1-4个GPU就能正常训练？ #307

Comments

buptzzl commented Nov 2, 2016

gangliao commented Nov 2, 2016 • edited Loading

hedaoyuan commented Nov 2, 2016

buptzzl commented Nov 2, 2016 • edited Loading

hedaoyuan commented Nov 2, 2016

buptzzl commented Nov 2, 2016

gangliao commented Nov 2, 2016 •

edited

Loading

buptzzl commented Nov 2, 2016 •

edited

Loading