From 5157ba692d53657c96f41c0a380219fe7a7a6b5a Mon Sep 17 00:00:00 2001 From: Helin Wang Date: Wed, 28 Jun 2017 20:25:56 +0000 Subject: [PATCH 1/3] create save model design doc --- doc/design/cluster_train/save_model.md | 100 +++++++++++++++++++++++++ 1 file changed, 100 insertions(+) create mode 100644 doc/design/cluster_train/save_model.md diff --git a/doc/design/cluster_train/save_model.md b/doc/design/cluster_train/save_model.md new file mode 100644 index 0000000000000..3a9a24fb9cef6 --- /dev/null +++ b/doc/design/cluster_train/save_model.md @@ -0,0 +1,100 @@ +# Design Doc: Save Model + +## Overview + +The model is the output of the training process. There are two +ways from which user can obtain a model: + +- Save model triggered by user code: user code asks PaddlePaddle to + save a model. +- Convert model from the snapshot: model being converted from + pservers' periodic snapshot. In this way, the user can cancel a job + at any time, and still have a relatively fresh model (we snapshot + around every 5 minutes). + +### Save Model Triggered by User Code + +Both trainers and pservers have access to the model. So the model can +be saved from a trainer or pservers. We need to decide on where the +model is saved from. + +#### Dense Model vs. Sparse Model + +There are two types of model: dense and sparse model (when the +parameter is configured to be sparse). Pservers always jointly have +the entire model at any given time. Trainers only have the entire +dense model, but only have a fraction of the sparse model at any given +time. + +#### Pservers Saving Model + +The benefit of letting pservers save model is they have the entire +model all the time. However, since pservers are on different nodes, it +requires a merging process to merge model shards into the same +model. Thus requires the pservers to write models to a distributed +filesystem, making the snapshot shards visible to the merge program. + +#### Trainer Saving Model + +The benefit of letting one trainer to save the model is it does not +require a distributed filesystem. And it's reusing the same save model +logic when the trainer is training locally - except when training +sparse model, the trainer needs to download the entire sparse model +during the saving process. + +#### Conclusion + +Given trainer saving model does not require a distributed filesystem, +and is an intuitive extension to training locally, we decide to let +the trainer save the model. + + +### Convert Model from Snapshot + +TODO + + +## Timeline + +We first implement trainer save the model. Converting the latest +snapshot to a model will be a TODO for future. + + +## Trainer Save Model + +### Trainer Election + +One trainer will be elected as the one to save the model. When using +etcd, trainer ID is a randomly generated UUID, we will utilize etcd to +elect one trainer. When not using etcd, unique trainer IDs will be +given by the administrator, the trainer whose ID is "0" is elected to +save the model. + +### Model Save Path + +Each trainer will be given the directory to save the model. The +elected trainer will save the model to +`given-directory/trainerID`. Since the tainerID is unique, this would +prevent concurrent save to the same file when multiple trainers are +elected to save the model when split-brain problem happens. + +### What Happens When Model Is Saving + +It takes some time to save model, we need to define what will happen +when save model is taking place. + +When saving a dense model, the trainer uses the local model. Pservers +does not need to pause model update. + +When saving a sparse model. The trainer needs to download the entire +sparse model while saving. To get the most accurate model, the model +update needs to be paused before the download starts and resumed after +the download finishes. Otherwise, the trainer gets a model that is +"polluted": some part of the model is old, some part of the model is +new. + +It's unclear that the "polluted" model will be inferiod due to the +stochastic nature of deep learning, and pausing the model update will +add more complexity to the system. Since supporting sparse model is a +TODO item. We defer the evaluation of pause the model update or not +during saving model to the future. From 7c066f6e3e43cfc2b43d46f5e860a291b125b3d4 Mon Sep 17 00:00:00 2001 From: Helin Wang Date: Fri, 30 Jun 2017 00:45:07 +0000 Subject: [PATCH 2/3] fix according to comments --- doc/design/cluster_train/save_model.md | 52 +++++++++++++++----------- 1 file changed, 31 insertions(+), 21 deletions(-) diff --git a/doc/design/cluster_train/save_model.md b/doc/design/cluster_train/save_model.md index 3a9a24fb9cef6..76ac8d8387073 100644 --- a/doc/design/cluster_train/save_model.md +++ b/doc/design/cluster_train/save_model.md @@ -7,24 +7,34 @@ ways from which user can obtain a model: - Save model triggered by user code: user code asks PaddlePaddle to save a model. -- Convert model from the snapshot: model being converted from - pservers' periodic snapshot. In this way, the user can cancel a job - at any time, and still have a relatively fresh model (we snapshot - around every 5 minutes). +- Convert model from the checkpoint: model being converted from + pservers' periodic checkpoint. In this way, the user can cancel a + job at any time, and still have a relatively fresh model (we + checkpoint around every 5 minutes). -### Save Model Triggered by User Code +### Trainer Saving Model vs. Pservers Saving Model Both trainers and pservers have access to the model. So the model can be saved from a trainer or pservers. We need to decide on where the model is saved from. -#### Dense Model vs. Sparse Model +#### Dense Update vs. Sparse Update + +There are two types of model update methods: dense update and sparse +update (when the parameter is configured to be sparse). + +- Dense update + + Every trainer has it's own full copy of the model. Every model + update will update the entire model. + +- Sparse update + + The training input is sparse, and the trainer does not have the + entire model. It will only download the sub-model necessary related + to the input. When updating the model, only the sub-model related to + the training input is updated. -There are two types of model: dense and sparse model (when the -parameter is configured to be sparse). Pservers always jointly have -the entire model at any given time. Trainers only have the entire -dense model, but only have a fraction of the sparse model at any given -time. #### Pservers Saving Model @@ -32,15 +42,15 @@ The benefit of letting pservers save model is they have the entire model all the time. However, since pservers are on different nodes, it requires a merging process to merge model shards into the same model. Thus requires the pservers to write models to a distributed -filesystem, making the snapshot shards visible to the merge program. +filesystem, making the checkpoint shards visible to the merge program. #### Trainer Saving Model The benefit of letting one trainer to save the model is it does not require a distributed filesystem. And it's reusing the same save model -logic when the trainer is training locally - except when training -sparse model, the trainer needs to download the entire sparse model -during the saving process. +logic when the trainer is training locally - except when doing sparse +update, the trainer needs to download the entire model during the +saving process. #### Conclusion @@ -49,7 +59,7 @@ and is an intuitive extension to training locally, we decide to let the trainer save the model. -### Convert Model from Snapshot +### Convert Model from Checkpoint TODO @@ -86,15 +96,15 @@ when save model is taking place. When saving a dense model, the trainer uses the local model. Pservers does not need to pause model update. -When saving a sparse model. The trainer needs to download the entire -sparse model while saving. To get the most accurate model, the model -update needs to be paused before the download starts and resumed after -the download finishes. Otherwise, the trainer gets a model that is +When doing sparse update. The trainer needs to download the entire +model while saving. To get the most accurate model, the model update +needs to be paused before the download starts and resumed after the +download finishes. Otherwise, the trainer gets a model that is "polluted": some part of the model is old, some part of the model is new. It's unclear that the "polluted" model will be inferiod due to the stochastic nature of deep learning, and pausing the model update will -add more complexity to the system. Since supporting sparse model is a +add more complexity to the system. Since supporting sparse update is a TODO item. We defer the evaluation of pause the model update or not during saving model to the future. From 62e582e8109ff08089f72e88511162fe51ae031f Mon Sep 17 00:00:00 2001 From: Helin Wang Date: Fri, 30 Jun 2017 18:23:46 +0000 Subject: [PATCH 3/3] polish wording and grammar. --- doc/design/cluster_train/save_model.md | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/doc/design/cluster_train/save_model.md b/doc/design/cluster_train/save_model.md index 76ac8d8387073..b70f00176b670 100644 --- a/doc/design/cluster_train/save_model.md +++ b/doc/design/cluster_train/save_model.md @@ -15,13 +15,13 @@ ways from which user can obtain a model: ### Trainer Saving Model vs. Pservers Saving Model Both trainers and pservers have access to the model. So the model can -be saved from a trainer or pservers. We need to decide on where the -model is saved from. +be saved from a trainer or pservers. We need to decide where the model +is saved from. #### Dense Update vs. Sparse Update There are two types of model update methods: dense update and sparse -update (when the parameter is configured to be sparse). +update (when the model parameter is configured to be sparse). - Dense update @@ -48,15 +48,15 @@ filesystem, making the checkpoint shards visible to the merge program. The benefit of letting one trainer to save the model is it does not require a distributed filesystem. And it's reusing the same save model -logic when the trainer is training locally - except when doing sparse -update, the trainer needs to download the entire model during the -saving process. +logic when training locally - except when doing sparse update, the +trainer needs to download the entire model during the saving process. #### Conclusion Given trainer saving model does not require a distributed filesystem, -and is an intuitive extension to training locally, we decide to let -the trainer save the model. +and is an intuitive extension to trainer saving model when training +locally, we decide to let the trainer save the model when doing +distributed training. ### Convert Model from Checkpoint @@ -84,16 +84,16 @@ save the model. Each trainer will be given the directory to save the model. The elected trainer will save the model to -`given-directory/trainerID`. Since the tainerID is unique, this would -prevent concurrent save to the same file when multiple trainers are -elected to save the model when split-brain problem happens. +`given-directory/trainerID`. Since the trainer ID is unique, this +would prevent concurrent save to the same file when multiple trainers +are elected to save the model when split-brain problem happens. ### What Happens When Model Is Saving It takes some time to save model, we need to define what will happen when save model is taking place. -When saving a dense model, the trainer uses the local model. Pservers +When doing dense update, the trainer uses the local model. Pservers does not need to pause model update. When doing sparse update. The trainer needs to download the entire @@ -103,7 +103,7 @@ download finishes. Otherwise, the trainer gets a model that is "polluted": some part of the model is old, some part of the model is new. -It's unclear that the "polluted" model will be inferiod due to the +It's unclear that the "polluted" model will be inferior due to the stochastic nature of deep learning, and pausing the model update will add more complexity to the system. Since supporting sparse update is a TODO item. We defer the evaluation of pause the model update or not