Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fault tolerant distributed training, just work version, with etcd #2849

Merged
merged 7 commits into from
Jul 17, 2017

Conversation

typhoonzero
Copy link
Contributor

@typhoonzero typhoonzero commented Jul 13, 2017

This is a "just work" version, supporting distributed training with etcd and master process.

@typhoonzero typhoonzero changed the title [WIP] Fault tolerant distributed training, work in progress Fault tolerant distributed training, just work version, with etcd Jul 14, 2017
Copy link
Contributor

@helinwang helinwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM++ few minor comments.

@@ -38,7 +39,8 @@ func (c *Client) getRecords() {
if err != nil {
// TODO(helin): wait before move on with next
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove the TODO, since it's completed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

etcd_endpoint = "http://" + etcd_ip + ":2379"


def cloud_reader():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can put this into paddle.reader.recordio in the future, so paddle.reader.recordio can support both local and cloud reader. (not necessary in this PR).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Will do this in next PR.

@@ -51,7 +51,8 @@ def __init__(self,
update_equation,
extra_layers=None,
is_local=True,
pserver_spec=None):
pserver_spec=None,
use_etcd=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can make the default requires less dependency, by defaulting use_etcd to False.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use_etcd has default value to be True currently

@typhoonzero typhoonzero merged commit 23b8346 into PaddlePaddle:develop Jul 17, 2017
import os
import cPickle as pickle

etcd_ip = os.getenv("MASTER_IP", "127.0.0.1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不好意思,这里上次没看见,是不是应该把“MASTER_IP”改成"ETCD_IP"? master貌似跟etcd没有关系。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前paddlecloud启动job时,etcd是启动在master pod里的。后续考虑多个job使用同一个etcd的情况下修改下。

parameterClient_(-1),
newParameters_(nullptr),
newGradients_(nullptr),
pserverSpec_(pserverSpec),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里有时候是pserver的addresses,有时候是etcd的address,是不是该改成remoteAddr之类的。

@typhoonzero typhoonzero deleted the etcd_cord branch August 11, 2017 06:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants