New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

add checkpoint util class and implement #10532

Merged

seiriosPlus merged 59 commits into PaddlePaddle:develop from seiriosPlus:checkpoint

May 23, 2018

Collaborator

seiriosPlus commented May 9, 2018

Add new feature about #10376

seiriosPlus added 30 commits

May 9, 2018 20:59


          add checkpoint util class and implement

568a329


          modify const to const &

1fabbba


          add ckpt to sync loop

77c6b71


          add ckpt attr to pserver python config

b81671e


          Merge branch 'develop' of github.com:PaddlePaddle/Paddle into checkpoint

e21a72d


          delete checkpoint function

2a05b3d


          add checkpoint save op

87a0856


          add checkpoint save op test

dc534fc


          rename cpkt_save_op

802d10c


          add build and test make

d1bd3fd


          add build and test make

5e74db3


          test add op declare

a1419f1


          rename ckpt -> checkpoint

461d2fc


          rename, modify ckpt structure

2f4c039


          move file_path to dir

38596cf


          add op to framework.py

ce1bcc9


          remove overwrite judge to test load

3c82006


          add checkpoint_load, update checkpoint save

f04b23a


          add checkpoint_load to python framework

c80125f


          write checkpoint_load code simply

2e25e73


          fix Serial output type

30b50dc


          fix bug

0334d49


          add api in distribute transpiler

d081256


          load implement

886897c


          modify get trainer param

9cf47af


          modify load op

c6f042f


          bug fix

b677d82


          add ckpt load

744e95d


          add X to test

955c793


          modify Get -> GetMutable

3dd2746


          merge develop

eff92d0

typhoonzero reviewed

View reviewed changes

paddle/fluid/operators/checkpoint_load_op.cc Outdated

		@@ -0,0 +1,214 @@
		/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.

Contributor

typhoonzero May 18, 2018

I see there's load_op and load_combine_op and corresponding saving ops, on python side, you can also use fluid.io.save_persistables to save all persistable variables.

In order to make save_persistables equal to save a checkpoint, make sure that the state variables are all "persistable" like step counters, learning rates, learning_rate moments etc.

So can you reuse those ops instead of writing some new one?

Collaborator Author

seiriosPlus May 18, 2018

load_op and save_op are designed for LodTensor variable, But checkpoint will save variables not only LodTensor， and checkpoint has some arguments particular.
At present, checkpoint load/save op and load/save op have no clear-cut distinction.

Contributor

typhoonzero May 19, 2018

I think it's better to reuse current operators maybe check the variable type will be fine.
So what the other variable types are saved in the checkpoint? "RAW" types and "feed" "fetch" may not need to be saved.

paddle/fluid/operators/listen_and_serv_op.cc

                   for (auto &var : sparse_vars) {
                     var->GetMutable<framework::SelectedRows>()->mutable_rows()->clear();
                   }

Contributor

typhoonzero May 18, 2018

This change may not be nessesary.

Collaborator Author

seiriosPlus May 18, 2018

I am sorry about it, I will revert it later.

python/paddle/fluid/transpiler/distribute_transpiler.py Outdated

+                      if checkpoint_dir and self.is_chief:
+                          program.global_block().create_var(
+                              name=SERIAL_VAR_NAME,

Contributor

typhoonzero May 18, 2018

what is a "serial"?

Collaborator Author

seiriosPlus May 18, 2018

serial is a serial number, like 0,1,2...100, each time when paddle needs to save checkpoint, the serial will auto incrementally.
If everything goes well, the biggest serial number will be used in load checkpoint.

Contributor

typhoonzero May 19, 2018

Call it a checkpoint ID maybe good for understanding

python/paddle/fluid/transpiler/distribute_transpiler.py Outdated

+                          save_vars = []
+                          for var in self.origin_program.list_vars():
+                              if self._is_persistable(var):

Contributor

typhoonzero May 18, 2018

can use fluid.io.save_persistables

python/paddle/fluid/transpiler/distribute_transpiler.py Outdated


		serial_number = self._get_lastest_checkpoint_dir(checkpoint_load_dir)

		s_prog.global_block().append_op(

Contributor

typhoonzero May 18, 2018

How do current parameter server know which parameter block to load?

python/paddle/fluid/transpiler/distribute_transpiler.py Outdated

+                      # is_chief (no.0 triner) for checkpoint
+                      # the no.0 trainer will save all variables and its own reader offset to checkpoint
+                      # other trianers will save its own reader offset to checkpoint
+                      self.is_chief = trainer_id == 0

Contributor

typhoonzero May 18, 2018

_is_chief

Collaborator Author

seiriosPlus May 18, 2018

I will fix it.

python/paddle/fluid/transpiler/distribute_transpiler.py Outdated

+                          except ValueError:
+                              return -1
+                          success_path = os.path.join(checkpoint_dir, cur_dir, SUCCESS)

Contributor

typhoonzero May 18, 2018

what is success_path used for?

Collaborator Author

seiriosPlus May 18, 2018

We need a tag to indicate that: the checkpoint content is right. So, I define a tag named _SUCCESS, when the checkpoint_save_op save all need variables successfully, it will write an empty file named _SUCCESS at last.
Because of this, when pserver/trainer need to load checkpoint, it will check _SUCCESS at first.

Contributor

typhoonzero May 18, 2018

If you only need to know when the saving ends, just wait for the executor returns, raise any error that may cause the saving fail is OK I think.

Collaborator Author

seiriosPlus May 18, 2018

If some exceptions happened in executor running, the executor does not have any information return, how do we know saving is OK.

Contributor

typhoonzero May 19, 2018

You can catch that exception I think, please give it a try, this will make the code more simple.

seiriosPlus added 12 commits

May 18, 2018 17:44


          bug fix

cd98f2b


          fix serial number

dbd0237


          fix serial number

22df4c2


          fix serial number

d98480c


          fix serial number

ee91e48


          optimize python checkpint dir config

b6ee59a


          optimize python checkpint dir config

e130bf3


          add checkpoint in io

5451c78


          add checkpoint in io

01975ec


          revert distribute_transpiler.py

ed2129c


          delete old checkpoint code

be05056


          Merge branch 'develop' of github.com:PaddlePaddle/Paddle into checkpoint

06aa23b

typhoonzero reviewed

View reviewed changes

python/paddle/fluid/io.py Outdated

+                                  save_secs=600,
+                                  main_program=None):
+                  """
+                  Save Variables to Checkpint Dir

Contributor

typhoonzero May 21, 2018

Checkpint => Checkpoint
Dir => Directory

Collaborator Author

seiriosPlus May 21, 2018

done

python/paddle/fluid/io.py Outdated

+              def save_checkpoint(executor,
+                                  dirname,
+                                  keep_max=3,

Contributor

typhoonzero May 21, 2018

keep_max => max_num_checkpoints

Collaborator Author

seiriosPlus May 21, 2018

done

python/paddle/fluid/io.py Outdated

+              def save_checkpoint(executor,
+                                  dirname,
+                                  keep_max=3,
+                                  save_secs=600,

Contributor

typhoonzero May 21, 2018

save_secs => interval

Collaborator Author

seiriosPlus May 21, 2018

done

python/paddle/fluid/io.py Outdated

+                  serial = _get_lastest_checkpoint_dir(dirname) + 1
+                  cur_dir = os.path.join(dirname, str(serial))
+                  # save_persistables(executor, cur_dir, main_program)

Contributor

typhoonzero May 21, 2018

No commented out codes please.

Collaborator Author

seiriosPlus May 21, 2018

done

python/paddle/fluid/io.py Outdated



		def save_checkpoint(executor,
		dirname,

Contributor

typhoonzero May 21, 2018

Can use current working directory as default?

Collaborator Author

seiriosPlus May 21, 2018

done

python/paddle/fluid/io.py Outdated

                       program = default_main_program()
                   var = program.global_block().var(name)
                   return get_parameter_value(var, executor)
+              SUCCESS = "_SUCCESS"

Contributor

typhoonzero May 21, 2018

SUCCESS = SUCCESS_MARK_FILENAME

Collaborator Author

seiriosPlus May 21, 2018

done

python/paddle/fluid/io.py Outdated

+                  if not os.path.isdir(dirname):
+                      os.makedirs(dirname)
+                  global BEGIN_SECS

Contributor

typhoonzero May 21, 2018

try not to use global please

Collaborator Author

seiriosPlus May 21, 2018

done

python/paddle/fluid/io.py Outdated

+                  serial = _get_lastest_checkpoint_dir(dirname) + 1
+                  cur_dir = os.path.join(dirname, str(serial))
+                  # save_persistables(executor, cur_dir, main_program)
+                  save_vars(

Contributor

typhoonzero May 21, 2018

why call save_vars instead of save_psersistables?

Collaborator Author

seiriosPlus May 21, 2018

save_psersistables can not filter out gradient variables.


          code optimized

2412dee

typhoonzero reviewed

View reviewed changes

python/paddle/fluid/io.py Outdated

-                  'get_inference_program',
+                  'save_vars', 'save_params', 'save_persistables', 'load_vars', 'load_params',
+                  'load_persistables', 'save_inference_model', 'load_inference_model',
+                  'get_inference_program', 'save_checkpoint', 'restore_checkpoint'

Contributor

typhoonzero May 22, 2018

maybe it's better to be named as load_checkpoint or restore_from_checkpoint?

Collaborator Author

seiriosPlus May 22, 2018

I will use load_checkpoint , restore_from_checkpoint is too long I think.

Collaborator Author

seiriosPlus May 22, 2018

done

python/paddle/fluid/io.py Outdated

		return True


		def _lru_delete(dirname, keep_max=3):

Contributor

typhoonzero May 22, 2018

keep_max => max_num_checkpoints

Collaborator Author

seiriosPlus May 22, 2018

done

python/paddle/fluid/io.py Outdated

+              def _lru_delete(dirname, keep_max=3):
+                  """
+                  retain checkpoint nums with keep_max

Contributor

typhoonzero May 22, 2018

keep_max => max_num_checkpoints

Collaborator Author

seiriosPlus May 22, 2018

done

python/paddle/fluid/io.py Outdated

+              def _write_success(dirname):
+                  """
+                  write _SUCCESS to checkpoint dir

Contributor

typhoonzero May 22, 2018

update _SUCCESS

Collaborator Author

seiriosPlus May 22, 2018

_SUCCESS is the file name ,not the var name

python/paddle/fluid/io.py Outdated

+              def _get_lastest_checkpoint_dir(checkpoint_dir):
+                  """
+                  get the biggest number in checkpoint_dir, which has _SUCCESS

Contributor

typhoonzero May 22, 2018

update _SUCCESS

Collaborator Author

seiriosPlus May 22, 2018

_SUCCESS is the file name, not the var name

seiriosPlus added 2 commits

May 22, 2018 10:07


          update var name

e901de6


          update python annotation

27b7175

typhoonzero reviewed

View reviewed changes

python/paddle/fluid/io.py Outdated

+                  Save Checkpoint will save persistable LodTensor variables from main_program in checkpoint directory,
+                  directory named by serial number from 0 to (n -1), save_checkpoint use LRU strategy
+                  to keep numbers of checkpoint directory,  the numbers of checkpoint directory are max_num_checkpoints at most,
+                  The interval time between two save_checkpoint must great than or equal to save_interval_secs.

Contributor

typhoonzero May 23, 2018

The interval between two saved checkpoints must greater than save_interval_secs.

Collaborator Author

seiriosPlus May 23, 2018

done

python/paddle/fluid/io.py Outdated

+              def load_checkpoint(executor, dirname=None, main_program=None):
+                  """
+                  Load checkpoint from directory by executor,

Contributor

typhoonzero May 23, 2018

directory => one directory

Collaborator Author

seiriosPlus May 23, 2018

done

python/paddle/fluid/io.py Outdated

+              def load_checkpoint(executor, dirname=None, main_program=None):
+                  """
+                  Load checkpoint from directory by executor,
+                  it will find lastest checkpoint file and load it auto.

Contributor

typhoonzero May 23, 2018

latest => the most recent saved checkpoint

python/paddle/fluid/io.py Outdated

+                          os.path.join(dirname, str(serial)), save_interval_secs):
+                      return
+                  serial = serial + 1

Contributor

typhoonzero May 23, 2018

can use +=

Collaborator Author

seiriosPlus May 23, 2018

done

python/paddle/fluid/io.py Outdated

+                      return
+                  serial = serial + 1
+                  cur_dir = os.path.join(dirname, str(serial))

Contributor

typhoonzero May 23, 2018

The checkpoint directories will be saved in "1", "2", etc. which may not make sense to users, I think it's better to be like "checkpoint_1", "checkpoint_2", and the "_SUCCESS" file can save a timestamp when the checkpoint is saved.

Collaborator Author

seiriosPlus May 23, 2018

done

seiriosPlus added 3 commits

May 23, 2018 10:24


          update annotation grammar

9d98534


          rename checkpoint folder to checkpoint_serial

d96b442


          bug fix

192f9a5

typhoonzero previously approved these changes

View reviewed changes

Contributor

typhoonzero left a comment

LGTM!


          add clean checkpoint

cf3fb24

seiriosPlus dismissed typhoonzero’s stale review via

cf3fb24

May 23, 2018 09:26


          add clean checkpoint

2c47e06

typhoonzero approved these changes

View reviewed changes

Contributor

typhoonzero left a comment

LGTM again.

seiriosPlus merged commit 397a69d into PaddlePaddle:develop

seiriosPlus deleted the checkpoint branch

May 23, 2018 11:47

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet