-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MP overlap for 1f1b #57446
MP overlap for 1f1b #57446
Conversation
… backward-forward-overlap-for-1f1b
… backward-forward-overlap-for-1f1b
… backward-forward-overlap-for-1f1b
你的PR提交成功,感谢你对开源项目的贡献! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall
@@ -354,6 +355,17 @@ def _apply_post_optimization( | |||
) | |||
params_grads = self._pass_context.get_attr("params_grads") | |||
|
|||
mp_async_allreduce_in_backward = os.getenv( | |||
"FLAGS_mp_async_allreduce_in_backward" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could use config like:
config["use_sharding"] = self._strategy.sharding.enable
for switch
@@ -34,6 +36,14 @@ | |||
] | |||
|
|||
|
|||
# NOTE: Here stream is just a presentation with different name, | |||
# it is up to executor to create the exact streams given the name. | |||
class AutoParallelStreamType(Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data parallel allreduce stream (in data_parallel_pass) maybe out of control by this TYPE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for pass_utils.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for cost model and cluster
forward_job = core.Job("forward") | ||
forward_job.set_micro_batch_id(forward_micro_batch_id) | ||
job_list.append(forward_job) | ||
for job_type in self.jobs_in_stable_phase: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1F1B-Overlap-Pass and 1F1B-Pass can be decoupled, cause their schedules are different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for pipeline_scheduler_pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overrall
* B-F overlap * Add column_parallel_linear_backward_overlapping * Add cost model * Insert reshape for ColumnParallelLinearBackwardOverlappingPass * Add cross-program event dependency * Refine split program in _backward_forward_overlap * Add empirical op cost * Add NOTE * Remove some redundant codes * Remove some redundant codes * Fix UTs
* B-F overlap * Add column_parallel_linear_backward_overlapping * Add cost model * Insert reshape for ColumnParallelLinearBackwardOverlappingPass * Add cross-program event dependency * Refine split program in _backward_forward_overlap * Add empirical op cost * Add NOTE * Remove some redundant codes * Remove some redundant codes * Fix UTs
PR types
Performance optimization
PR changes
Others
Description
PCard-71568
本PR在静半中实现以下两个优化点,以期通过多流隐藏MP通信提升大模型端到端性能。
【1F1B场景下反向阶段与另一个micro-bach前向的overlap】
当前状态收益提升不够明显,只能隐藏反向不到1/3的allreduce 通信,小规模下测试端到端收益只有1%左右。依赖于以下问题的解决以挖掘更多优化空间:
大块的功能支持
(1)开发一套准确率更高的cost_model,精确评估每个算子的耗时,以辅助细粒度的算子多流编排
(2)开发基于cost_model的自适应算子拆分机制,将前向计算中耗时远大于反向通信的算子拆分成多个粒度更细的算子,避免提前拉大算子延长反向计算时间,导致流水bubble增加
(3)流优先级分配方案的设计和实现,减少device端多流调度和同步开销
小点的适配优化
(1)实现PP recv通信多流,避免前向recv在计算流上阻塞计算算子
(2)消除冗余c_identity,避免出现耗时超过MP通信的c_identity拷贝影响编排
【反向阶段MP通信与matmul_grad计算overlap】
相比反向与前向的细粒度编排和overlap,MP通信与matmul_grad计算的overlap只是一个顺带的小优化点,对齐动态图实现:#55662
当前已通过
column_parallel_linear_backward_overlapping
支持,GPT-3 6.7B MP2-PP4下有约1%收益,更大规模任务下的收益待后续测试。这两项优化当前都不是最终实现状态,因关联代码改动较多,为避免与静半其它优化工作相互依赖和冲突,本PR先做合入之后再做进一步的迭代和调优。