-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
train.log报connection错误 #596
Labels
Comments
请问是集群配置有问题吗?我是启动了两个docker container 分别在两台宿主机上 |
@sarawon 可以提供完整的日志么? |
还需要什么日志啊 我贴下 上面两个文件的已经都贴了全部了 |
docker端口映射少了 现在好了 |
zhhsplendid
pushed a commit
to zhhsplendid/Paddle
that referenced
this issue
Sep 25, 2019
add rcnn doc
heavengate
pushed a commit
to heavengate/Paddle
that referenced
this issue
Aug 24, 2022
AnnaTrainingG
pushed a commit
to AnnaTrainingG/Paddle
that referenced
this issue
Sep 19, 2022
* fix benchmark * fix benchmark
lizexu123
pushed a commit
to lizexu123/Paddle
that referenced
this issue
Feb 23, 2024
* fix prune doc * fix prune demo batchsize * fix lr (PaddlePaddle#593) * fix lr schedule in prune demo (PaddlePaddle#595) * fix prune demo batchsize * fix lr shcedule in prune demo ; Co-authored-by: wanghaoshuang <wanghaoshuang@baidu.com> * remove softmax from demo/models (PaddlePaddle#596) * fix prune demo batchsize * fix lr shcedule in prune demo ; * remove softmax from demo/models Co-authored-by: wanghaoshuang <wanghaoshuang@baidu.com> * fix prune demo log Co-authored-by: wanghaoshuang <wanghaoshuang@baidu.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
跑的是cluster模式,执行start trainer的task时候就卡住了:
[root@192.168.30.131:8023] Executing task 'start_trainer'
[root@192.168.30.131:8023] run: cd /root/paddle/demo/recommendation; GLOG_logtostderr=0 GLOG_log_dir="./log" nohup paddle train --num_gradient_servers=2 --nics=eth0 --port=7164 --ports_num=2 --comment=paddle_process_by_paddle --pservers=192.168.30.131,192.168.30.179 --ports_num_for_sparse=2 --config=./trainer_config.py --trainer_count=4 --use_gpu=0 --num_passes=10 --save_dir=./output --log_period=50 --dot_period=10 --saving_period=1 --local=0 --trainer_id=0 > ./log/train.log 2>&1 < /dev/null &
[root@192.168.30.131:8023] out: stdin: is not a tty
[root@192.168.30.131:8023] out:
[root@192.168.30.179:8023] Executing task 'start_trainer'
[root@192.168.30.179:8023] run: cd /root/paddle/demo/recommendation; GLOG_logtostderr=0 GLOG_log_dir="./log" nohup paddle train --num_gradient_servers=2 --nics=eth0 --port=7164 --ports_num=2 --comment=paddle_process_by_paddle --pservers=192.168.30.131,192.168.30.179 --ports_num_for_sparse=2 --config=./trainer_config.py --trainer_count=4 --use_gpu=0 --num_passes=10 --save_dir=./output --log_period=50 --dot_period=10 --saving_period=1 --local=0 --trainer_id=1 > ./log/train.log 2>&1 < /dev/null &
[root@192.168.30.179:8023] out: stdin: is not a tty
[root@192.168.30.179:8023] out:
train.log的内容:$MYDIR/../opt/paddle/bin/paddle_trainer $ {@:2}
[INFO 2016-11-24 07:17:26,152 networks.py:1466] The input order is [movie_id, title, genres, user_id, gender, age, occupation, rating]
[INFO 2016-11-24 07:17:26,152 networks.py:1472] The output order is [regression_cost_0]
F1124 07:17:26.942348 352 LightNetwork.cpp:379] Check failed: connect(sockfd, (sockaddr *)&serv_addr, sizeof(serv_addr)) >= 0 ERROR connecting to 192.168.30.131: Connection refused [111]
*** Check failure stack trace: ***
@ 0x7f1604a93daa (unknown)
@ 0x7f1604a93ce4 (unknown)
@ 0x7f1604a936e6 (unknown)
@ 0x7f1604a934fb (unknown)
@ 0x7f1604a94477 (unknown)
@ 0x69552e paddle::SocketClient::TcpClient()
@ 0x696051 paddle::SocketClient::SocketClient()
@ 0x7eaa76 std::vector<>::emplace_back<>()
@ 0x7e1be3 paddle::ParameterClient2::init()
@ 0x68e2dd paddle::RemoteParameterUpdater::init()
@ 0x678de2 paddle::Trainer::init()
@ 0x5132a9 main
@ 0x7f1603c9ff45 (unknown)
@ 0x51f2a5 (unknown)
@ (nil) (unknown)
/usr/local/bin/paddle: line 109: 352 Aborted (core dumped) ${DEBUGGER}
server.log的内容:$MYDIR/../opt/paddle/bin/paddle_pserver_main $ {@:2}
F1124 07:19:03.638399 418 SocketChannel.cpp:180] Check failed: len == sizeof(header) : Success [0]
*** Check failure stack trace: ***
@ 0x7f9fb4dbfdaa (unknown)
@ 0x7f9fb4dbfce4 (unknown)
@ 0x7f9fb4dbf6e6 (unknown)
@ 0x7f9fb4dbf4fb (unknown)
@ 0x7f9fb4dc0477 (unknown)
@ 0x667eb8 paddle::SocketChannel::readMessage()
@ 0x6657dc paddle::SocketWorker::run()
@ 0x7f9fb493ca60 (unknown)
@ 0x7f9fb5bd2184 start_thread
@ 0x7f9fb40a437d (unknown)
@ (nil) (unknown)
/usr/local/bin/paddle: line 109: 321 Aborted (core dumped) ${DEBUGGER}
The text was updated successfully, but these errors were encountered: