Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix several bugs for enabling Paddle to train with CINN. #36739

Merged
merged 8 commits into from
Oct 28, 2021

Conversation

wzzju
Copy link
Contributor

@wzzju wzzju commented Oct 26, 2021

PR types

Bug fixes

PR changes

Others

Describe

#36739 (本PR) 和 #36698 均用于解决Paddle训练接入CINN全流程打通过程中遇到的问题。所有问题列举如下:

  • build_cinn_pass中找到并创建的子图中可能存在ControlVar,这些ControlVar的VarDesc为NULL,需要对这种corner case进行特殊处理,方案详见 Fix the null ptr bug in build_cinn_pass. #36698
  • build_cinn_pass放在ParallelExecutorPassBuilder的最后加入,会与后面的显存优化Pass功能进行冲突,为了避免与其他Pass功能冲突,将build_cinn_pass提前至第一个Pass运行。
  • build_cinn_pass在创建CinnLaunchOp后,需要为其设置op_role属性值,否则在其之后运行的Pass会报错找不到改属性。
  • 修正build_cinn_pass的子图创建逻辑,处理输出节点亦可作为子图内部输入的情况。
  • 简化build_cinn_pass节点之间连接关系,以及无用节点的删除逻辑。
  • CinnGraphSymbolization符号化过程使用的拓扑序不能用框架已有的代码(根据Node ID大小排序),因此为其专门重写了一个子图拓扑排序功能。
  • 在创建Cinn Scope时需要完成Paddle Parameter名称到Cinn变量名称的映射,否则CinnLaunchOp执行时会出现找不到权重Tensor的问题。
  • CinnLaunchOp执行时也需要Paddle输入到Cinn变量的映射,因此需要完善FeedOpMapper的功能,详见 fix feed opmapper not AddVarModelToProgram CINN#515
  • 更新test_parallel_executor_run_cinn.py的内容,使用add + relu且固定数值的模型运算逻辑(方便精度对齐),并添加调试所需的信息。

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

set_cinn_flag(False)


if __name__ == '__main__':
unittest.main()
do_test(tempfile.mkdtemp(prefix="dots_"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这种debug信息还用保留么?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@@ -136,6 +153,18 @@ std::unique_ptr<Graph> CreateNewSubGraph(const GraphNodeSet& cluster,
}
old_var2new_var[var] = sub_node;
}
for (auto* var : cluster_inputs) {
if (var->Var()) {
auto* sub_node = subgraph->CreateVarNode(var->Var());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这儿为啥不要var->Var() == nullptrCreateEmptyNode啊?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为子图不需要这种var_desc为null的节点,CINN编译的时候也用不到。而这种空var只是为了加上依赖关系,在修改后的大图上已经有提现。


def test_run_with_cinn(self):
do_test(self.tmpdir)
set_cinn_flag(False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这儿set_cinn_flag(False)应该删了,不然后面的测试都不会运行了

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update.

exe.run(startup_program)

build_strategy = paddle.static.BuildStrategy()
build_strategy.debug_graphviz_path = os.path.join(dot_save_dir, "viz")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些deubg信息没必要保留吧?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

调试信息还是保留吧,这个类会处理这些保存的文件,退出时即删除。

@@ -130,12 +147,25 @@ std::unique_ptr<Graph> CreateNewSubGraph(const GraphNodeSet& cluster,
for (auto* var : cluster_internals) {
Node* sub_node;
if (var->Var() == nullptr) {
// TODO(wzzju): If this case occurs, there maybe bugs when using CINN.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个bug还在么?如果还在是不是最好弄个PADDLE_ENFORCE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@wzzju wzzju changed the title Add graph debug infos. Fix several debugs for enabling Paddle with CINN. Oct 28, 2021
@wzzju wzzju changed the title Fix several debugs for enabling Paddle with CINN. Fix several bugs for enabling Paddle to train with CINN. Oct 28, 2021
Copy link
Contributor

@thisjiang thisjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wzzju wzzju merged commit c93331c into PaddlePaddle:develop Oct 28, 2021
ghost pushed a commit to piotrekobi/Paddle that referenced this pull request Nov 3, 2021
…e#36739)

* Update the content of `test_parallel_executor_run_cinn.py`.

* Fix some bugs in the topological sort and `CreateNewSubGraph`.

* Update the CINN commit id used by Paddle.

* Update the unit test to `add+relu`.

* Update according to reviewers' suggestion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants