Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train.log报connection错误 #596

Closed
sarawon opened this issue Nov 24, 2016 · 4 comments
Closed

train.log报connection错误 #596

sarawon opened this issue Nov 24, 2016 · 4 comments
Labels

Comments

@sarawon
Copy link

sarawon commented Nov 24, 2016

跑的是cluster模式,执行start trainer的task时候就卡住了:
[root@192.168.30.131:8023] Executing task 'start_trainer'
[root@192.168.30.131:8023] run: cd /root/paddle/demo/recommendation; GLOG_logtostderr=0 GLOG_log_dir="./log" nohup paddle train --num_gradient_servers=2 --nics=eth0 --port=7164 --ports_num=2 --comment=paddle_process_by_paddle --pservers=192.168.30.131,192.168.30.179 --ports_num_for_sparse=2 --config=./trainer_config.py --trainer_count=4 --use_gpu=0 --num_passes=10 --save_dir=./output --log_period=50 --dot_period=10 --saving_period=1 --local=0 --trainer_id=0 > ./log/train.log 2>&1 < /dev/null &
[root@192.168.30.131:8023] out: stdin: is not a tty
[root@192.168.30.131:8023] out:

[root@192.168.30.179:8023] Executing task 'start_trainer'
[root@192.168.30.179:8023] run: cd /root/paddle/demo/recommendation; GLOG_logtostderr=0 GLOG_log_dir="./log" nohup paddle train --num_gradient_servers=2 --nics=eth0 --port=7164 --ports_num=2 --comment=paddle_process_by_paddle --pservers=192.168.30.131,192.168.30.179 --ports_num_for_sparse=2 --config=./trainer_config.py --trainer_count=4 --use_gpu=0 --num_passes=10 --save_dir=./output --log_period=50 --dot_period=10 --saving_period=1 --local=0 --trainer_id=1 > ./log/train.log 2>&1 < /dev/null &
[root@192.168.30.179:8023] out: stdin: is not a tty
[root@192.168.30.179:8023] out:

train.log的内容:
[INFO 2016-11-24 07:17:26,152 networks.py:1466] The input order is [movie_id, title, genres, user_id, gender, age, occupation, rating]
[INFO 2016-11-24 07:17:26,152 networks.py:1472] The output order is [regression_cost_0]
F1124 07:17:26.942348 352 LightNetwork.cpp:379] Check failed: connect(sockfd, (sockaddr *)&serv_addr, sizeof(serv_addr)) >= 0 ERROR connecting to 192.168.30.131: Connection refused [111]
*** Check failure stack trace: ***
@ 0x7f1604a93daa (unknown)
@ 0x7f1604a93ce4 (unknown)
@ 0x7f1604a936e6 (unknown)
@ 0x7f1604a934fb (unknown)
@ 0x7f1604a94477 (unknown)
@ 0x69552e paddle::SocketClient::TcpClient()
@ 0x696051 paddle::SocketClient::SocketClient()
@ 0x7eaa76 std::vector<>::emplace_back<>()
@ 0x7e1be3 paddle::ParameterClient2::init()
@ 0x68e2dd paddle::RemoteParameterUpdater::init()
@ 0x678de2 paddle::Trainer::init()
@ 0x5132a9 main
@ 0x7f1603c9ff45 (unknown)
@ 0x51f2a5 (unknown)
@ (nil) (unknown)
/usr/local/bin/paddle: line 109: 352 Aborted (core dumped) ${DEBUGGER} $MYDIR/../opt/paddle/bin/paddle_trainer ${@:2}

server.log的内容:
F1124 07:19:03.638399 418 SocketChannel.cpp:180] Check failed: len == sizeof(header) : Success [0]
*** Check failure stack trace: ***
@ 0x7f9fb4dbfdaa (unknown)
@ 0x7f9fb4dbfce4 (unknown)
@ 0x7f9fb4dbf6e6 (unknown)
@ 0x7f9fb4dbf4fb (unknown)
@ 0x7f9fb4dc0477 (unknown)
@ 0x667eb8 paddle::SocketChannel::readMessage()
@ 0x6657dc paddle::SocketWorker::run()
@ 0x7f9fb493ca60 (unknown)
@ 0x7f9fb5bd2184 start_thread
@ 0x7f9fb40a437d (unknown)
@ (nil) (unknown)
/usr/local/bin/paddle: line 109: 321 Aborted (core dumped) ${DEBUGGER} $MYDIR/../opt/paddle/bin/paddle_pserver_main ${@:2}

@sarawon
Copy link
Author

sarawon commented Nov 24, 2016

请问是集群配置有问题吗?我是启动了两个docker container 分别在两台宿主机上
docker container的22端口映射到宿主机的8022端口
7164到7167映射到宿主机的7164到7167
conf.py里的hosts指定的是root@宿主机ip:8022
然后运行的cluster_train下面的run.sh

@backyes
Copy link
Contributor

backyes commented Nov 24, 2016

@sarawon 可以提供完整的日志么?

@sarawon
Copy link
Author

sarawon commented Nov 24, 2016

还需要什么日志啊 我贴下 上面两个文件的已经都贴了全部了

@sarawon
Copy link
Author

sarawon commented Nov 24, 2016

docker端口映射少了 现在好了

@sarawon sarawon closed this as completed Nov 24, 2016
zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this issue Sep 25, 2019
heavengate pushed a commit to heavengate/Paddle that referenced this issue Aug 24, 2022
AnnaTrainingG pushed a commit to AnnaTrainingG/Paddle that referenced this issue Sep 19, 2022
* fix benchmark

* fix benchmark
lizexu123 pushed a commit to lizexu123/Paddle that referenced this issue Feb 23, 2024
* fix prune doc

* fix prune demo batchsize

* fix lr (PaddlePaddle#593)

* fix lr schedule in prune demo (PaddlePaddle#595)

* fix prune demo batchsize

* fix lr shcedule in prune demo
;

Co-authored-by: wanghaoshuang <wanghaoshuang@baidu.com>

* remove softmax from demo/models (PaddlePaddle#596)

* fix prune demo batchsize

* fix lr shcedule in prune demo
;

* remove softmax from demo/models

Co-authored-by: wanghaoshuang <wanghaoshuang@baidu.com>

* fix prune demo log

Co-authored-by: wanghaoshuang <wanghaoshuang@baidu.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants