Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1,trainer_count=N是对应使用N个GPU吗?2,为啥使用6个GPU抛内存错误,使用1-4个GPU就能正常训练? #307

Closed
buptzzl opened this issue Nov 2, 2016 · 5 comments
Assignees

Comments

@buptzzl
Copy link

buptzzl commented Nov 2, 2016

1,trainer_count=N是对应使用N个GPU吗?2,为啥使用6个GPU抛内存错误,使用1-4个GPU就能正常训练?

@gangliao
Copy link
Contributor

gangliao commented Nov 2, 2016

@buptzzl @hedaoyuan

文档最下面:
http://www.paddlepaddle.org/doc/ui/cmd_argument/use_case.html?highlight=trainer_count
请问你们机器上有多少张GPU卡? 另外,请粘贴详细的错误信息 谢谢。

@hedaoyuan
Copy link
Contributor

@buptzzl

  1. trainer_count=N,指的是用N个训练线程来计算,如果你的系统有多于N个GPU,那么就会用N个GPU来训练。如果系统少于N个GPU,默认是会报错(一般也不建议N > GPU个数)。paddle程序运行N > GPU个数的方式需要在命令行增加·--allow_only_one_model_on_one_gpu=false·参数。
  2. 为啥使用6个GPU抛内存错误,这个具体是报了什么样的内存错误?

@buptzzl
Copy link
Author

buptzzl commented Nov 2, 2016

@gangliao 机器有16块GPU,24个CPU。
@hedaoyuan 在样本数较少(18w, )时报如下错误:
I1102 16:53:13.679253 23091 TrainerInternal.cpp:179] Pass=3 Batch=911 samples=182171 AvgCost=0.23115 Eval: classification_error_evaluator=0.014179 classification_error_evaluator=0.014179 auc_evaluator_0=0.573843-
F1102 16:53:14.301820 24853 hl_cuda_device.cc:646] Check failed: cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encountered
*** Check failure stack trace: ***
@ 0x7ff725cc2a3d google::LogMessage::Fail()
@ 0x7ff725cc6ed7 google::LogMessage::SendToLog()
@ 0x7ff725cc4d39 google::LogMessage::Flush()
@ 0x7ff725cc503d google::LogMessageFatal::~LogMessageFatal()
@ 0x98585f hl_stream_synchronize()
@ 0x991ae4 hl_matrix_csr_mul_dense()
@ 0x796dac paddle::GpuMatrix::mul()
@ 0x7a8a64 paddle::GpuMatrix::mul()
@ 0x600f81 paddle::FullyConnectedLayer::forward()
@ 0x6f42b6 paddle::NeuralNetwork::forward()
@ 0x6ea073 paddle::TrainerThread::forward()
@ 0x6eb538 paddle::TrainerThread::computeThread()
@ 0x7ff72580d2a8 execute_native_thread_routine
@ 0x318b207851 (unknown)
@ 0x318aee767d (unknown)

对应的train.sh中的配置:
--trainer_count=6
--log_period=200
--num_passes=100
--use_gpu=1
--show_parameter_stats_period=100
--test_all_data_in_one_period=1 \

@hedaoyuan
Copy link
Contributor

There is a bug fix for sparse matrix calculate #133 . Try v0.8.0beta.1.

@buptzzl
Copy link
Author

buptzzl commented Nov 2, 2016

@hedaoyuan 谢谢!

@buptzzl buptzzl closed this as completed Nov 2, 2016
gglin001 added a commit to graphcore/Paddle-fork that referenced this issue Dec 8, 2021
wangxicoding pushed a commit to wangxicoding/Paddle that referenced this issue Dec 9, 2021
…patch command (PaddlePaddle#307)

Improve the performance of ft custom op and use patch command
AnnaTrainingG pushed a commit to AnnaTrainingG/Paddle that referenced this issue Sep 19, 2022
danleifeng pushed a commit to danleifeng/Paddle that referenced this issue Jun 13, 2023
* sharding for infer

* sharing for infer

---------

Co-authored-by: yangjunchao <yangjunchao@baidu.com>
danleifeng pushed a commit to danleifeng/Paddle that referenced this issue Sep 13, 2023
* sharding for infer

* sharing for infer

---------

Co-authored-by: yangjunchao <yangjunchao@baidu.com>
lizexu123 pushed a commit to lizexu123/Paddle that referenced this issue Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants