Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add model average optimizer for fluid #9082

Merged
merged 13 commits into from
Mar 22, 2018

Conversation

wanghaoshuang
Copy link
Contributor

@wanghaoshuang wanghaoshuang commented Mar 14, 2018

fix #9172
And the result of some experiments was attached in #9172.

@wanghaoshuang wanghaoshuang changed the title Add sum accumulator with window for model average Add model average optimizer for fluid Mar 18, 2018
Copy link
Contributor

@qingqing01 qingqing01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The review has not been completed yet.

"accumulating sums of parameter values with the same shape as "
"input(param).");
AddInput("in_num_accumulates",
"Input(Tensor): The accumulating times of current window with "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tensor<int64_t>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

AverageAccumulatesOpMaker(OpProto* proto, OpAttrChecker* op_checker)
: OpProtoAndCheckerMaker(proto, op_checker) {
AddInput("param",
"Input(Tensor or LoDTensor): The parameter to be accumulated.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Input(Tensor or LoDTensor) -> (Tensor or LoDTensor)

There is no Input before (

https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/mul_op.cc#L79

The same as below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

AddInput("param",
"Input(Tensor or LoDTensor): The parameter to be accumulated.");
AddInput("in_sum_1",
"Input(Tensor or LoDTensor): A tensor used to store the parameter "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, maybe all the inputs and outputs are Tensor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


AddComment(R"DOC(
AverageAccumulates Operator.
Accumulate the sum of parameter whtin sliding window. The size of sliding window is determined by 'average_window', 'max_average_window' and 'min_average_window'.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to more details to show how to average.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

using EigenVector = framework::EigenVector<T, MajorType, IndexType>;

template <typename DeviceContext>
void getAccumulators(const framework::ExecutionContext& ctx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getAccumulators -> GetAccumulators

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

int64_t& old_num_accumulates);

template <typename DeviceContext>
void setAccumulators(const framework::ExecutionContext& ctx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setAccumulators -> SetAccumulators

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

public:
void Compute(const framework::ExecutionContext& ctx) const override {
// It is used to avoid loss of precision
static const int64_t kMaxNumAccumulates = 16384;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reference paper for kMaxNumAccumulates 16384?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that 16384 is an experimental value. There are no reference papers.

Copy link
Contributor

@qingqing01 qingqing01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work!!

"before this batch with shape [1].");

AddAttr<float>("average_window",
"The rate of average window size relative to num_updates.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set 0. as the default value here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

AddAttr<float>("average_window",
"The rate of average window size relative to num_updates.");
AddAttr<int64_t>("max_average_window", "Maximum size of average window.");
AddAttr<int64_t>("min_average_window", "Minimu size of average window.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set 10000L as the default value for min_average_window ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

out_sum_2_tensor.device(place) = in_sum_2_tensor;
out_sum_3_tensor.device(place) = in_sum_3_tensor;
if (num_updates % kMaxNumAccumulates == 0) {
out_sum_2_tensor.device(place) = in_sum_2_tensor + in_sum_1_tensor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comments before lin 87:

Move the sum to a different buffer to avoid loss of precision due to too many sums.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if (num_accumulates >= min_average_window &&
num_accumulates >= std::min<int64_t>(max_average_window,
num_updates * average_window)) {
out_sum_3_tensor.device(place) = in_sum_1_tensor + in_sum_2_tensor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comments before line 94:

Now the average window is too long, discard the old sum.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

self._append_average_accumulate_op(param)

def _add_average_apply_op(self, block, param_grad):
param = block.clone_variable(param_grad[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use clone here? 这里clone实现来看,Variable的名字、存储内容(Tensor)都一样,为什么需要clone呢?可以直接用原始的Variable吗?

Copy link
Contributor Author

@wanghaoshuang wanghaoshuang Mar 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Op在做InferShape的时候,需要从当前block中查找input variables, 所以需要clone_variable function clone一份variable desc放到当前blcok中,同时修改variable.block为当前block. 否则,InferShape会有Input not found错误。

"""
assert isinstance(var, Variable)
return self.create_var(
name=var.name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我理解‘clone’的var和输入的var是两片空间,这里var的name都一样,更像是‘共享’同一个var。


AddAttr<float>("average_window",
"The rate of average window size relative to num_updates.");
AddAttr<int64_t>("max_average_window", "Maximum size of average window.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改下这里的注释吧,让用户手动设置成,一个pass/epoc里总共的mini-batch数。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

model_average.apply()
for data in test_reader():
exe.run(inference_program...)
model_average.restore(exe)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可用通过with model_average.apply() 语法,隐藏model_average.restore 调用。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thx.

"shape [1].");
AddInput("in_num_updates",
"Input(Tensor): The total number of batches used by trainning "
"before this batch with shape [1].");
Copy link
Contributor

@qingqing01 qingqing01 Mar 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in_num_accumulates
in_old_num_accumulates
in_num_updates

这3个标量用fill_constant初始化的时候可以用fore_cpu属性,让这些标量始终在CPU上,这样GPU计算时,就不用拷贝了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果op是通过继承OperatorWithKernel 实现的话,在执行之前,这里会判断inputs是不是都是在期望的device上并将其转到期望的device上。
但是,OperatorWithKernel提供的自动转换不支持input和output共享内存的情况.
如果不继承OperatorWithKernel, 应该会有一定的修改工作量,可以放在后续PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

明白了,那就现在这样吧。觉得更好的是,支持Variable<int/float>这样的变量作为op的输入。

Copy link
Contributor

@qingqing01 qingqing01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the feature of model average uses much memory, need to support do_average_in_cpu in next PR.

Copy link
Contributor

@qingqing01 qingqing01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create an issue for this two problems before merge this PR.

params_grads: A list of parameter-grad variable pairs.
average_window_rate: The rate of average window.
min_average_window: The minimum size of average window.
max_average_window: The maximum size of average window.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user document needs to refine, should tell users how to set average_window_rate, average_window_rate, max_average_window, and so on.

@wanghaoshuang wanghaoshuang merged commit b594251 into PaddlePaddle:develop Mar 22, 2018
@wanghaoshuang wanghaoshuang deleted the average_model branch May 20, 2022 03:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add model average optimizer for fluid
2 participants