-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the bug when using 0-D tensor in MoE model #5538
Conversation
Thanks for your contribution! |
Codecov Report
@@ Coverage Diff @@
## develop #5538 +/- ##
===========================================
- Coverage 61.94% 59.43% -2.52%
===========================================
Files 491 482 -9
Lines 69118 68103 -1015
===========================================
- Hits 42817 40475 -2342
- Misses 26301 27628 +1327 |
fuse_attn_qkv: False | ||
fused_linear: False | ||
fuse_attn_qkv: True | ||
scale_qk_by_layer_num: True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
确认哪些配置在自动并行中是确实可配置的,再决定是否添加。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sequence_parallel没有使用,删除了,组网中的参数也删除了
@@ -2,8 +2,8 @@ _base_: ./pretrain_gpt_base.yaml | |||
|
|||
Global: | |||
global_batch_size: | |||
local_batch_size: 8 | |||
micro_batch_size: 8 | |||
local_batch_size: 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
尽可能在 bash 脚本中通过 -o Global.local_batch_size
的方式来配置,不修改yaml。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yaml已经修改回去了,修改了projects/gpt里的运行脚本来修改配置
@@ -2,8 +2,8 @@ _base_: ./pretrain_gpt_base.yaml | |||
|
|||
Global: | |||
global_batch_size: | |||
local_batch_size: 8 | |||
micro_batch_size: 8 | |||
local_batch_size: 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
max_seq_len: 1024 | ||
|
||
sampler: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
自动并行不需要sampler 配置。也不需要 loader 配置。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
删除了sampler和loader
Eval: | ||
collate_fn: gpt_collate_fn | ||
sample_split: 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
以上两行不能删
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
补回去了
if self.use_recompute and self.recompute_granularity == "core_attn": | ||
out, weights = auto.recompute(self.core_attn)(q, k, v, attn_mask=attn_mask) | ||
if self.use_recompute and self.recompute_granularity == "core_attn" and self.do_recompute: | ||
out, weights = recompute(self.core_attn, q, k, v, attn_mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
自动并行的 recompute 接口和动态图的不一样。这里不能改。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改成自动并行接口了
@@ -1004,7 +1137,6 @@ def _post_process_(outputs, input_ids, cur_len, origin_len, scores, unfinished_f | |||
# make the shape of attention_mask = (-1, -1, -1, -1) in dy2static. | |||
model_kwargs["attention_mask"] = paddle.reshape(attn_mask, paddle.shape(attn_mask)) | |||
model_kwargs["cache"] = outputs[1] if isinstance(outputs, tuple) else None | |||
max_length = paddle.to_tensor(max_length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这行不能删。动转静需要。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
补回去了
# early finish should be True in generation scenes, | ||
# If users want to test the inference speed, you can just set it False. | ||
if self.early_finish and not paddle.any(unfinished_flag): | ||
if not paddle.any(unfinished_flag): | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里也不能改。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改回去了
loss_mask = loss_mask.reshape([-1]) | ||
masked_lm_loss = paddle.sum(masked_lm_loss.reshape([-1]) * loss_mask) | ||
loss = masked_lm_loss / loss_mask.sum() | ||
return loss | ||
|
||
|
||
class GPTForSequenceClassification(nn.Layer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
自动并行还没有支持这个任务。可删除。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已删除。
@@ -620,7 +723,7 @@ class GPTPretrainingCriterionAuto(nn.Layer): | |||
Criterion for GPT. It calculates the final loss. | |||
""" | |||
|
|||
def __init__(self, mesh): | |||
def __init__(self, mesh, topo=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
topo不需要了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已删除
1c79fd7
to
2ed0e71
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Others
PR changes
Others
Description
Fix the bug when using 0-D tensor in MoE model