Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MP overlap for 1f1b #57446

Merged
merged 14 commits into from
Sep 19, 2023
Merged

MP overlap for 1f1b #57446

merged 14 commits into from
Sep 19, 2023

Conversation

From00
Copy link
Contributor

@From00 From00 commented Sep 18, 2023

PR types

Performance optimization

PR changes

Others

Description

PCard-71568

本PR在静半中实现以下两个优化点,以期通过多流隐藏MP通信提升大模型端到端性能。

【1F1B场景下反向阶段与另一个micro-bach前向的overlap】
当前状态收益提升不够明显,只能隐藏反向不到1/3的allreduce 通信,小规模下测试端到端收益只有1%左右。依赖于以下问题的解决以挖掘更多优化空间:

  • 大块的功能支持
    (1)开发一套准确率更高的cost_model,精确评估每个算子的耗时,以辅助细粒度的算子多流编排
    (2)开发基于cost_model的自适应算子拆分机制,将前向计算中耗时远大于反向通信的算子拆分成多个粒度更细的算子,避免提前拉大算子延长反向计算时间,导致流水bubble增加
    (3)流优先级分配方案的设计和实现,减少device端多流调度和同步开销

  • 小点的适配优化
    (1)实现PP recv通信多流,避免前向recv在计算流上阻塞计算算子
    (2)消除冗余c_identity,避免出现耗时超过MP通信的c_identity拷贝影响编排

【反向阶段MP通信与matmul_grad计算overlap】
相比反向与前向的细粒度编排和overlap,MP通信与matmul_grad计算的overlap只是一个顺带的小优化点,对齐动态图实现:#55662
当前已通过column_parallel_linear_backward_overlapping支持,GPT-3 6.7B MP2-PP4下有约1%收益,更大规模任务下的收益待后续测试。

这两项优化当前都不是最终实现状态,因关联代码改动较多,为避免与静半其它优化工作相互依赖和冲突,本PR先做合入之后再做进一步的迭代和调优。

@paddle-bot
Copy link

paddle-bot bot commented Sep 18, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Contributor

@JZ-LIANG JZ-LIANG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall

@@ -354,6 +355,17 @@ def _apply_post_optimization(
)
params_grads = self._pass_context.get_attr("params_grads")

mp_async_allreduce_in_backward = os.getenv(
"FLAGS_mp_async_allreduce_in_backward"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use config like:
config["use_sharding"] = self._strategy.sharding.enable
for switch

@@ -34,6 +36,14 @@
]


# NOTE: Here stream is just a presentation with different name,
# it is up to executor to create the exact streams given the name.
class AutoParallelStreamType(Enum):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data parallel allreduce stream (in data_parallel_pass) maybe out of control by this TYPE.

Copy link
Contributor

@heavyrain-lzy heavyrain-lzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for pass_utils.py

Copy link
Contributor

@Caozhou1995 Caozhou1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for cost model and cluster

forward_job = core.Job("forward")
forward_job.set_micro_batch_id(forward_micro_batch_id)
job_list.append(forward_job)
for job_type in self.jobs_in_stable_phase:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1F1B-Overlap-Pass and 1F1B-Pass can be decoupled, cause their schedules are different.

Copy link
Contributor

@zhaoyinglia zhaoyinglia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for pipeline_scheduler_pass

Copy link
Contributor

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overrall

@From00 From00 merged commit 7264bb7 into PaddlePaddle:develop Sep 19, 2023
Frida-a pushed a commit to Frida-a/Paddle that referenced this pull request Oct 14, 2023
* B-F overlap

* Add column_parallel_linear_backward_overlapping

* Add cost model

* Insert reshape for ColumnParallelLinearBackwardOverlappingPass

* Add cross-program event dependency

* Refine split program in _backward_forward_overlap

* Add empirical op cost

* Add NOTE

* Remove some redundant codes

* Remove some redundant codes

* Fix UTs
danleifeng pushed a commit to danleifeng/Paddle that referenced this pull request Nov 14, 2023
* B-F overlap

* Add column_parallel_linear_backward_overlapping

* Add cost model

* Insert reshape for ColumnParallelLinearBackwardOverlappingPass

* Add cross-program event dependency

* Refine split program in _backward_forward_overlap

* Add empirical op cost

* Add NOTE

* Remove some redundant codes

* Remove some redundant codes

* Fix UTs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants