Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainer save model in distributed training. #2512

Closed
jacquesqiao opened this issue Jun 19, 2017 · 5 comments
Closed

Trainer save model in distributed training. #2512

jacquesqiao opened this issue Jun 19, 2017 · 5 comments
Assignees

Comments

@jacquesqiao
Copy link
Member

No description provided.

@jacquesqiao jacquesqiao self-assigned this Jun 19, 2017
@helinwang
Copy link
Contributor

helinwang commented Jun 20, 2017

Hi @jacquesqiao , please refer to this design doc for more information: checkpointing

Please note that the design doc is about checkpointing, but this issue is about save model.

The main differences are:

  1. Checkpointing save additional states, such as optimizer state.
  2. Checkpointing don't involve merge checkpoint files from different Pservers into a same file. But save model need to merge parameter files into a single file. Since merging is involved, some kind of synchronization is needed.

We can discuss the synchronization ideas, here is my initial idea for your consideration: maybe Pserver #0 can be the one who performs the merge process. And Pserver #0 will watch etcd notification, to be notified when saving model finished on other Pservers.

Feel free to split this issue to different smaller issues, so that the size of each PR can be small.

@helinwang
Copy link
Contributor

helinwang commented Jun 20, 2017

Also please see this discussion for a reference of implementing getting ID for Pservers: #1620 (comment)

Edit: Sorry this is not very relevant. Keeping it here in case you are interested.

@jacquesqiao jacquesqiao removed their assignment Jun 20, 2017
@helinwang helinwang self-assigned this Jun 24, 2017
@dzhwinter
Copy link
Contributor

get it, thanks for the clear explanation! @helinwang

@helinwang helinwang removed their assignment Jun 28, 2017
@Yancey1989
Copy link
Contributor

Following this discuss #2638 , It seems that we declining to save model on the trainer, so maybe we can close this issue?

@gongweibao gongweibao self-assigned this Jun 29, 2017
@gongweibao gongweibao changed the title Pserver save model. Pserver save model.@gongweibao Jun 29, 2017
@gongweibao gongweibao removed their assignment Jun 30, 2017
@gongweibao gongweibao changed the title Pserver save model.@gongweibao Pserver save model. Jun 30, 2017
@helinwang helinwang changed the title Pserver save model. Trainer save model in distributed training. Jul 11, 2017
@helinwang
Copy link
Contributor

Following this discuss #2638 , It seems that we declining to save model on the trainer, so maybe we can close this issue?

@Yancey1989 Thanks! I have changed the title to Trainer save model in distributed training..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants