-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Auto Parallel] Support semi-auto trainer and fit Llama2 training #7885
[Auto Parallel] Support semi-auto trainer and fit Llama2 training #7885
Conversation
Thanks for your contribution! |
9668320
to
97498b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改动比较大,先 Request changes 一手。
97498b9
to
16bca68
Compare
) | ||
|
||
return optimizer | ||
def _wrap_dist_loader(self, train_dataloader): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not used in dynamic mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, now it's used in dynamic and static mode.
paddlenlp/trainer/auto_trainer.py
Outdated
meshes.append(_get_mesh(pp_idx)) | ||
return meshes | ||
|
||
def _wrap_dist_loader(self, train_dataloader): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the difference with _wrap_dist_loader
in run_pretrain_3D_auto.py?
paddlenlp/trainer/auto_trainer.py
Outdated
shard_dims="dp", | ||
) | ||
|
||
def _wrap_for_static(self, model, train_dataloader): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's called in Trainer
from paddlenlp/trainer/trainer.py
, for wrapping model
into DistModel
in static mode
@@ -939,15 +877,16 @@ def forward( | |||
if position_ids is None: | |||
position_ids = paddle.arange(seq_length, dtype="int64").expand((batch_size, seq_length)) | |||
# NOTE(zhaoyingli): infer spmd does not support [seq_len] --> [batch, seq_len] in data_parallel | |||
position_ids = dist.shard_tensor(position_ids, get_mesh(), [dist.Shard(0), dist.Replicate()]) | |||
position_ids = dist.shard_tensor(position_ids, get_mesh(), [dist.Replicate(), dist.Replicate()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change to replicated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because in static mode, infer spmd hasn't supported the case -- "[seq_len] --> [batch, seq_len]"
0afdc96
to
624abd7
Compare
paddlenlp/trainer/trainer.py
Outdated
if self.args.use_auto_parallel and self.args.run_static_semi_auto: | ||
model = self._wrap_for_static(model, train_dataloader) | ||
|
||
self.model = model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self.args.use_auto_parallel and self.args.run_static_semi_auto: | |
model = self._wrap_for_static(model, train_dataloader) | |
self.model = model | |
if self.args.use_auto_parallel and self.args.run_static_semi_auto: | |
model = self._wrap_for_static(model, train_dataloader) | |
self.model = model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
624abd7
to
6a381c3
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## develop #7885 +/- ##
===========================================
- Coverage 56.80% 56.57% -0.23%
===========================================
Files 588 589 +1
Lines 89536 89900 +364
===========================================
+ Hits 50858 50865 +7
- Misses 38678 39035 +357 ☔ View full report in Codecov by Sentry. |
4df557e
to
dee9d04
Compare
e3dfa0b
to
eda936c
Compare
eda936c
to
e541379
Compare
def _maybe_log_save_evaluate(self, tr_loss, model, epoch, ignore_keys_for_eval, **kwargs): | ||
if self.control.should_log: | ||
|
||
logs: Dict[str, float] = {} | ||
|
||
# all_gather + mean() to get average loss over all processes | ||
tr_loss_scalar = self._nested_gather(tr_loss).mean().item() | ||
tr_loss_scalar = self._get_item_from_loss(self._nested_gather(tr_loss).mean()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PaddleNLP/paddlenlp/trainer/trainer.py
Lines 1199 to 1209 in fe6b45d
def _get_item_from_loss(self, loss): | |
assert isinstance(loss, paddle.Tensor) and loss._is_initialized() | |
return loss.item() | |
def _maybe_log_save_evaluate(self, tr_loss, model, epoch, ignore_keys_for_eval, **kwargs): | |
if self.control.should_log: | |
logs: Dict[str, float] = {} | |
# all_gather + mean() to get average loss over all processes | |
tr_loss_scalar = self._get_item_from_loss(self._nested_gather(tr_loss).mean()) |
这里 我看你是复用了 _maybe_log_save_evaluate 函数。而且外面包括了 guard,为什么这里 要加一个 assert isinstance(loss, paddle.Tensor) and loss._is_initialized()
的检查?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里可以删掉,半自动判断逻辑在 auto_trainer
中重写即可
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
辛苦删除一下?
paddlenlp/trainer/training_args.py
Outdated
@@ -747,6 +748,8 @@ class TrainingArguments: | |||
default=False, | |||
metadata={"help": "reshard pp even if pp degree in the model and pp degree in script match"}, | |||
) | |||
parallel_mode: str = field(default="hybrid", metadata={"help": ""}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里注释写详细一点?标注一下只为自动并行或者半自动并行使用?有什么是选的?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
paddlenlp/trainer/training_args.py
Outdated
@@ -747,6 +748,8 @@ class TrainingArguments: | |||
default=False, | |||
metadata={"help": "reshard pp even if pp degree in the model and pp degree in script match"}, | |||
) | |||
parallel_mode: str = field(default="hybrid", metadata={"help": ""}) | |||
run_static_semi_auto: bool = field(default=True, metadata={"help": ""}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个参数的具体意义又是?两个选项是否存在合并的可能?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个参数用以区分动半执行还是静半执行;默认值为True
,表示在静半模型下执行训练端到端流程;若手动设置为False
,在动半模式下执行训练,可方便用户完成组网标记等模块的调试。
if kwargs.get("args", None) is not None and kwargs["args"].run_static_semi_auto: | ||
if kwargs.get("criterion", None) is None: | ||
|
||
def loss_func(loss, outputs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需要 定义 criterion 吗?现在paddlenlp的模型,loss基本在模型里面了。不额外定义。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
静半架构运行需要一个假的critirion,哪怕直接返回loss也可以
total_batch_size_per_acc_step = self.args.per_device_train_batch_size * self.args.dataset_world_size | ||
total_batch_size = total_batch_size_per_acc_step * self.args.gradient_accumulation_steps | ||
batch_size = total_batch_size if self.args.run_static_semi_auto else total_batch_size_per_acc_step | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不是很懂,是说 acc 会并 run_static_semi_auto 里面控制,所以bs更大?
会不会出现数据不一致问题,开启run_static_semi_auto 与否?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改,batch sample里统一传入global batch size
;除静半PP策略,其他场景的梯度累加都需要在读出global batch
后在batch dim
维度做split,然后for循环执行,完成梯度累加
79f68f3
to
3c8be71
Compare
3c8be71
to
dbc331a
Compare
paddlenlp/trainer/training_args.py
Outdated
) | ||
}, | ||
) | ||
run_static_auto: bool = field(default=True, metadata={"help": "whether to run static graph in auto parallel mode"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about?
"hybrid"
"auto"
"auto_static"
"auto_semi"
"auto_semi_static"
f236a8b
to
683782c
Compare
683782c
to
b3d8e06
Compare
b3d8e06
to
f13a0bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Bug fixes
PR changes
Others
Description
[Auto Parallel] Support semi-auto trainer and fit Llama2 training