[Auto Parallel] Support semi-auto trainer and fit Llama2 training #7885

haohongxiang · 2024-01-23T03:26:16Z

PR types

Bug fixes

PR changes

Others

Description

[Auto Parallel] Support semi-auto trainer and fit Llama2 training

paddle-bot · 2024-01-23T03:26:21Z

Thanks for your contribution!

ZHUI

改动比较大，先 Request changes 一手。

zhiqiu · 2024-01-23T06:46:10Z

llm/llama/auto_parallel/run_pretrain_3D_auto.py

-    )
-
-    return optimizer
+    def _wrap_dist_loader(self, train_dataloader):


It is not used in dynamic mode?

Done, now it's used in dynamic and static mode.

zhiqiu · 2024-01-23T06:47:18Z

paddlenlp/trainer/auto_trainer.py

+            meshes.append(_get_mesh(pp_idx))
+        return meshes
+
+    def _wrap_dist_loader(self, train_dataloader):


what's the difference with _wrap_dist_loader in run_pretrain_3D_auto.py?

zhiqiu · 2024-01-23T06:51:51Z

paddlenlp/trainer/auto_trainer.py

+            shard_dims="dp",
+        )
+
+    def _wrap_for_static(self, model, train_dataloader):


it seems not used?

It's called in Trainer from paddlenlp/trainer/trainer.py, for wrapping model into DistModel in static mode

zhiqiu · 2024-01-23T06:52:52Z

paddlenlp/transformers/llama/modeling_3D_auto.py

@@ -939,15 +877,16 @@ def forward(
        if position_ids is None:
            position_ids = paddle.arange(seq_length, dtype="int64").expand((batch_size, seq_length))
            # NOTE(zhaoyingli): infer spmd does not support [seq_len] --> [batch, seq_len] in data_parallel
-            position_ids = dist.shard_tensor(position_ids, get_mesh(), [dist.Shard(0), dist.Replicate()])
+            position_ids = dist.shard_tensor(position_ids, get_mesh(), [dist.Replicate(), dist.Replicate()])


why change to replicated?

Because in static mode, infer spmd hasn't supported the case -- "[seq_len] --> [batch, seq_len]"

ZHUI · 2024-01-24T09:34:42Z

paddlenlp/trainer/trainer.py

+        if self.args.use_auto_parallel and self.args.run_static_semi_auto:
+            model = self._wrap_for_static(model, train_dataloader)
+
+        self.model = model


Suggested change

if self.args.use_auto_parallel and self.args.run_static_semi_auto:

model = self._wrap_for_static(model, train_dataloader)

self.model = model

if self.args.use_auto_parallel and self.args.run_static_semi_auto:

model = self._wrap_for_static(model, train_dataloader)

self.model = model

paddlenlp/trainer/trainer.py

codecov · 2024-01-25T07:20:41Z

Codecov Report

Attention: 398 lines in your changes are missing coverage. Please review.

Comparison is base (44bfeb0) 56.80% compared to head (f13a0bf) 56.57%.

Files	Patch %	Lines
paddlenlp/trainer/auto_trainer.py	0.00%	326 Missing ⚠️
paddlenlp/transformers/llama/modeling_3D_auto.py	4.54%	42 Missing ⚠️
paddlenlp/trainer/training_args.py	47.36%	20 Missing ⚠️
paddlenlp/trainer/trainer_utils.py	30.76%	9 Missing ⚠️
paddlenlp/trainer/trainer.py	88.88%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #7885      +/-   ##
===========================================
- Coverage    56.80%   56.57%   -0.23%     
===========================================
  Files          588      589       +1     
  Lines        89536    89900     +364     
===========================================
+ Hits         50858    50865       +7     
- Misses       38678    39035     +357

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

paddlenlp/trainer/trainer.py

ZHUI · 2024-01-29T06:53:42Z

paddlenlp/trainer/trainer.py

    def _maybe_log_save_evaluate(self, tr_loss, model, epoch, ignore_keys_for_eval, **kwargs):
        if self.control.should_log:

            logs: Dict[str, float] = {}

            # all_gather + mean() to get average loss over all processes
-            tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
+            tr_loss_scalar = self._get_item_from_loss(self._nested_gather(tr_loss).mean())


PaddleNLP/paddlenlp/trainer/trainer.py

Lines 1199 to 1209 in fe6b45d

def _get_item_from_loss(self, loss):

assert isinstance(loss, paddle.Tensor) and loss._is_initialized()

return loss.item()

def _maybe_log_save_evaluate(self, tr_loss, model, epoch, ignore_keys_for_eval, **kwargs):

if self.control.should_log:

logs: Dict[str, float] = {}

# all_gather + mean() to get average loss over all processes

tr_loss_scalar = self._get_item_from_loss(self._nested_gather(tr_loss).mean())

这里我看你是复用了 _maybe_log_save_evaluate 函数。而且外面包括了 guard，为什么这里要加一个 assert isinstance(loss, paddle.Tensor) and loss._is_initialized()的检查？

这里可以删掉，半自动判断逻辑在 auto_trainer 中重写即可

辛苦删除一下？

ZHUI · 2024-01-29T07:04:04Z

paddlenlp/trainer/training_args.py

@@ -747,6 +748,8 @@ class TrainingArguments:
        default=False,
        metadata={"help": "reshard pp even if pp degree in the model and pp degree in script match"},
    )
+    parallel_mode: str = field(default="hybrid", metadata={"help": ""})


这里注释写详细一点？标注一下只为自动并行或者半自动并行使用？有什么是选的？

ZHUI · 2024-01-29T07:04:47Z

paddlenlp/trainer/training_args.py

@@ -747,6 +748,8 @@ class TrainingArguments:
        default=False,
        metadata={"help": "reshard pp even if pp degree in the model and pp degree in script match"},
    )
+    parallel_mode: str = field(default="hybrid", metadata={"help": ""})
+    run_static_semi_auto: bool = field(default=True, metadata={"help": ""})


这个参数的具体意义又是？两个选项是否存在合并的可能？

这个参数用以区分动半执行还是静半执行；默认值为True，表示在静半模型下执行训练端到端流程；若手动设置为False，在动半模式下执行训练，可方便用户完成组网标记等模块的调试。

llm/llama/auto_parallel/run_pretrain_3D_auto.py

paddlenlp/trainer/auto_trainer.py

ZHUI · 2024-01-29T07:09:32Z

paddlenlp/trainer/auto_trainer.py

+        if kwargs.get("args", None) is not None and kwargs["args"].run_static_semi_auto:
+            if kwargs.get("criterion", None) is None:
+
+                def loss_func(loss, outputs):


需要定义 criterion 吗？现在paddlenlp的模型，loss基本在模型里面了。不额外定义。

静半架构运行需要一个假的critirion，哪怕直接返回loss也可以

ZHUI · 2024-01-29T07:13:23Z

paddlenlp/trainer/auto_trainer.py

+        total_batch_size_per_acc_step = self.args.per_device_train_batch_size * self.args.dataset_world_size
+        total_batch_size = total_batch_size_per_acc_step * self.args.gradient_accumulation_steps
+        batch_size = total_batch_size if self.args.run_static_semi_auto else total_batch_size_per_acc_step
+


不是很懂，是说 acc 会并 run_static_semi_auto 里面控制，所以bs更大？

会不会出现数据不一致问题，开启run_static_semi_auto 与否？

已修改，batch sample里统一传入global batch size；除静半PP策略，其他场景的梯度累加都需要在读出global batch后在batch dim维度做split，然后for循环执行，完成梯度累加

ZHUI · 2024-01-30T06:50:46Z

paddlenlp/trainer/training_args.py

+            )
+        },
+    )
+    run_static_auto: bool = field(default=True, metadata={"help": "whether to run static graph in auto parallel mode"})


How about?

"hybrid" "auto" "auto_static" "auto_semi" "auto_semi_static"

ZHUI

LGTM

haohongxiang force-pushed the semi_auto_trainer_llama2 branch 6 times, most recently from 9668320 to 97498b9 Compare January 23, 2024 06:22

ZHUI requested changes Jan 23, 2024

View reviewed changes

support semi-auto trainer and fit Llama2 training

16bca68

haohongxiang force-pushed the semi_auto_trainer_llama2 branch from 97498b9 to 16bca68 Compare January 23, 2024 06:37

zhiqiu reviewed Jan 23, 2024

View reviewed changes

haohongxiang force-pushed the semi_auto_trainer_llama2 branch 6 times, most recently from 0afdc96 to 624abd7 Compare January 24, 2024 08:26

ZHUI reviewed Jan 24, 2024

View reviewed changes

support shard_dataloader in dynamic semi-auto

6a381c3

haohongxiang force-pushed the semi_auto_trainer_llama2 branch from 624abd7 to 6a381c3 Compare January 25, 2024 06:38

haohongxiang force-pushed the semi_auto_trainer_llama2 branch from 4df557e to dee9d04 Compare January 25, 2024 08:16

rewrite traning loop

b3e64c3

haohongxiang force-pushed the semi_auto_trainer_llama2 branch 4 times, most recently from e3dfa0b to eda936c Compare January 28, 2024 22:12

Merge branch 'develop' into semi_auto_trainer_llama2

e541379

haohongxiang force-pushed the semi_auto_trainer_llama2 branch from eda936c to e541379 Compare January 28, 2024 23:15

refactor traning loop

fe6b45d

ZHUI reviewed Jan 29, 2024

View reviewed changes

haohongxiang force-pushed the semi_auto_trainer_llama2 branch 2 times, most recently from 79f68f3 to 3c8be71 Compare January 30, 2024 03:56

Merge branch 'develop' into semi_auto_trainer_llama2

dbc331a

haohongxiang force-pushed the semi_auto_trainer_llama2 branch from 3c8be71 to dbc331a Compare January 30, 2024 06:42

ZHUI reviewed Jan 30, 2024

View reviewed changes

refine args of auto trainer

b1265e3

haohongxiang force-pushed the semi_auto_trainer_llama2 branch from f236a8b to 683782c Compare January 31, 2024 05:35

haohongxiang added 2 commits January 31, 2024 08:41

broadcast loss

5509d9a

Merge branch 'develop' into semi_auto_trainer_llama2

ce311bd

haohongxiang force-pushed the semi_auto_trainer_llama2 branch from 683782c to b3d8e06 Compare January 31, 2024 08:56

add auto ci cases

f13a0bf

haohongxiang force-pushed the semi_auto_trainer_llama2 branch from b3d8e06 to f13a0bf Compare January 31, 2024 09:38

ZHUI approved these changes Jan 31, 2024

View reviewed changes

wawltor merged commit 3a704ea into PaddlePaddle:develop Jan 31, 2024
7 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Auto Parallel] Support semi-auto trainer and fit Llama2 training #7885

[Auto Parallel] Support semi-auto trainer and fit Llama2 training #7885

haohongxiang commented Jan 23, 2024

paddle-bot bot commented Jan 23, 2024

ZHUI left a comment

zhiqiu Jan 23, 2024

haohongxiang Jan 24, 2024

zhiqiu Jan 23, 2024

zhiqiu Jan 23, 2024

haohongxiang Jan 24, 2024 •

edited

Loading

zhiqiu Jan 23, 2024

haohongxiang Jan 24, 2024

ZHUI Jan 24, 2024

haohongxiang Jan 25, 2024

codecov bot commented Jan 25, 2024 •

edited

Loading

ZHUI Jan 29, 2024

haohongxiang Jan 29, 2024

ZHUI Jan 31, 2024

ZHUI Jan 29, 2024

haohongxiang Jan 30, 2024

ZHUI Jan 29, 2024

haohongxiang Jan 30, 2024

ZHUI Jan 29, 2024

haohongxiang Jan 30, 2024

ZHUI Jan 29, 2024

haohongxiang Jan 30, 2024

ZHUI Jan 30, 2024

ZHUI left a comment

	def _get_item_from_loss(self, loss):
	assert isinstance(loss, paddle.Tensor) and loss._is_initialized()
	return loss.item()

	def _maybe_log_save_evaluate(self, tr_loss, model, epoch, ignore_keys_for_eval, **kwargs):
	if self.control.should_log:

	logs: Dict[str, float] = {}

	# all_gather + mean() to get average loss over all processes
	tr_loss_scalar = self._get_item_from_loss(self._nested_gather(tr_loss).mean())

[Auto Parallel] Support semi-auto trainer and fit Llama2 training #7885

[Auto Parallel] Support semi-auto trainer and fit Llama2 training #7885

Conversation

haohongxiang commented Jan 23, 2024

PR types

PR changes

Description

paddle-bot bot commented Jan 23, 2024

ZHUI left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haohongxiang Jan 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jan 25, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZHUI left a comment

Choose a reason for hiding this comment

haohongxiang Jan 24, 2024 •

edited

Loading

codecov bot commented Jan 25, 2024 •

edited

Loading