-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Saving models/parameters in fault tolerent training #2638
Comments
Thanks @typhoonzero for leading the discussion! Here is my summary of possible solutions: There are two ways for saving model:
For 2, we need to provide user a way to convert snapshots saved in different places by different pserers to one single model file. (maybe not immediately required). For 1, there are two ways: a. Trainer save model.
b. Pservers save model (can share code with the snapshot process, but need to have a pserver that converts saved snapshots into a same model).
After some more thinking, I am currently inclining to let trainer save the model, because it works even when a distributed filesystem is not present. Would love to know how you guys think! |
I am also inclining to save the model by the trainer, for the resong as following:
FROM @typhoonzero
I think the trainer also need a API do the snapshort, it will save the mode and call |
@helinwang
consider the decouple pserver with etcd cluster as you guys methoned, sounds like saving model in leader trainer is a good idea. There is some problem on this way. |
@dzhwinter Thanks for the feedback! Very valuable.
Yes, when using etcd we can elect a leader trainer, when training on MPI without etcd. The trainer could have trainer IDs, the trainer with ID 0 could be the leader.
In my opinion, pollution is fine, as usual in deep learning, the training process is stochastic. Also when doing AGSD training, during one trainer downloading the model, we allow the model to be modified if other trainers upload the gradients during the trainer's download. So the model training process already involves "pollution". Maybe we can allow it, unless its proven to be more harm than benefit (system simplicity).
I think we can know who is leader. When using etcd, we can elect an leader by etcd. When doing MPI, trainer 0 can be the leader. |
@typhoonzero @Yancey1989 @dzhwinter Thanks for the comments! Seems we all think let trainer save the model would be a good idea (please let me know otherwise). I have create a PR summarizing our discussion. Please review: #2655 |
@helinwang
假定允许模型污染,受到污染的时间窗口为 : |
如果保证高精度不受任何影响,需要double内存(因为不知道哪个机器会被选为trainer 0, 即使MPI集群,也是MPI Role函数随机决定的),所以有上面评论得到的结论。 |
@dzhwinter 懂了!这里有假设一个模型如果有一些是旧的,一些是新的,就会差。我认为不一定会成立。因为更新模型是按照mini-batch算出来的梯度乘以一个很小的步长更新的。更新之后的模型只是对于这一个mini-batch来说是更好的,因为随机性很大,无法说是对于测试集是有正影响还是负影响。 |
恩,这个影响收益是正负不好说。提醒一下这个点,暂时就想到这么多 |
@dzhwinter 明白了。因为第一版并不支持稀疏更新,能否我们先把这个放在TODO,到时候再权衡一下。 |
Related PR: #2634
In a discussion with @helinwang this morning, previous thought was to save parameters to a distributed storage service by merging parameters from all pservers.
In general there are two ways:
Save
to trigger snapshot saving, each pserver saves parameters on the distributed filesystem, this also saves the pserver status for recovering. Users can use a "model merge tool" to merge all the parts of the model and then use it.hash(trainer_ip) % trainer_count == 0
Notice: when users want to stop the training and use the current output models, he can stop the job right away, because the job will save every pass model into the distributed storage service.
The text was updated successfully, but these errors were encountered: