-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trainer save model in distributed training. #2512
Comments
Hi @jacquesqiao , please refer to this design doc for more information: checkpointing Please note that the design doc is about checkpointing, but this issue is about save model. The main differences are:
We can discuss the synchronization ideas, here is my initial idea for your consideration: maybe Pserver #0 can be the one who performs the merge process. And Pserver #0 will watch etcd notification, to be notified when saving model finished on other Pservers. Feel free to split this issue to different smaller issues, so that the size of each PR can be small. |
Also please see this discussion for a reference of implementing getting ID for Pservers: #1620 (comment) Edit: Sorry this is not very relevant. Keeping it here in case you are interested. |
get it, thanks for the clear explanation! @helinwang |
Following this discuss #2638 , It seems that we declining to save model on the trainer, so maybe we can close this issue? |
@Yancey1989 Thanks! I have changed the title to |
No description provided.
The text was updated successfully, but these errors were encountered: