Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the bug when using 0-D tensor in MoE model #5538

Merged
merged 1 commit into from
May 9, 2023

Conversation

pkuzyc
Copy link
Contributor

@pkuzyc pkuzyc commented Apr 5, 2023

PR types

Others

PR changes

Others

Description

Fix the bug when using 0-D tensor in MoE model

@paddle-bot
Copy link

paddle-bot bot commented Apr 5, 2023

Thanks for your contribution!

@pkuzyc pkuzyc changed the title fix the bug in lr_scheduler init and fix the diff of GPT model in aut… [AutoParallel] fix the diff in lr_scheduler and GPT model Apr 5, 2023
@codecov
Copy link

codecov bot commented Apr 6, 2023

Codecov Report

Merging #5538 (aa4cedc) into develop (b7246e1) will decrease coverage by 2.52%.
The diff coverage is n/a.

❗ Current head aa4cedc differs from pull request most recent head 67389b6. Consider uploading reports for the commit 67389b6 to get more accurate results

@@             Coverage Diff             @@
##           develop    #5538      +/-   ##
===========================================
- Coverage    61.94%   59.43%   -2.52%     
===========================================
  Files          491      482       -9     
  Lines        69118    68103    -1015     
===========================================
- Hits         42817    40475    -2342     
- Misses       26301    27628    +1327     

see 123 files with indirect coverage changes

fuse_attn_qkv: False
fused_linear: False
fuse_attn_qkv: True
scale_qk_by_layer_num: True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

确认哪些配置在自动并行中是确实可配置的,再决定是否添加。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sequence_parallel没有使用,删除了,组网中的参数也删除了

@@ -2,8 +2,8 @@ _base_: ./pretrain_gpt_base.yaml

Global:
global_batch_size:
local_batch_size: 8
micro_batch_size: 8
local_batch_size: 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

尽可能在 bash 脚本中通过 -o Global.local_batch_size 的方式来配置,不修改yaml。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yaml已经修改回去了,修改了projects/gpt里的运行脚本来修改配置

@@ -2,8 +2,8 @@ _base_: ./pretrain_gpt_base.yaml

Global:
global_batch_size:
local_batch_size: 8
micro_batch_size: 8
local_batch_size: 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

max_seq_len: 1024

sampler:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

自动并行不需要sampler 配置。也不需要 loader 配置。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

删除了sampler和loader

Eval:
collate_fn: gpt_collate_fn
sample_split: 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

以上两行不能删

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

补回去了

if self.use_recompute and self.recompute_granularity == "core_attn":
out, weights = auto.recompute(self.core_attn)(q, k, v, attn_mask=attn_mask)
if self.use_recompute and self.recompute_granularity == "core_attn" and self.do_recompute:
out, weights = recompute(self.core_attn, q, k, v, attn_mask)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

自动并行的 recompute 接口和动态图的不一样。这里不能改。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改成自动并行接口了

@@ -1004,7 +1137,6 @@ def _post_process_(outputs, input_ids, cur_len, origin_len, scores, unfinished_f
# make the shape of attention_mask = (-1, -1, -1, -1) in dy2static.
model_kwargs["attention_mask"] = paddle.reshape(attn_mask, paddle.shape(attn_mask))
model_kwargs["cache"] = outputs[1] if isinstance(outputs, tuple) else None
max_length = paddle.to_tensor(max_length)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这行不能删。动转静需要。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

补回去了

# early finish should be True in generation scenes,
# If users want to test the inference speed, you can just set it False.
if self.early_finish and not paddle.any(unfinished_flag):
if not paddle.any(unfinished_flag):
break
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里也不能改。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改回去了

loss_mask = loss_mask.reshape([-1])
masked_lm_loss = paddle.sum(masked_lm_loss.reshape([-1]) * loss_mask)
loss = masked_lm_loss / loss_mask.sum()
return loss


class GPTForSequenceClassification(nn.Layer):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

自动并行还没有支持这个任务。可删除。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已删除。

@@ -620,7 +723,7 @@ class GPTPretrainingCriterionAuto(nn.Layer):
Criterion for GPT. It calculates the final loss.
"""

def __init__(self, mesh):
def __init__(self, mesh, topo=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

topo不需要了

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已删除

@pkuzyc pkuzyc force-pushed the develop branch 2 times, most recently from 1c79fd7 to 2ed0e71 Compare April 7, 2023 09:52
@pkuzyc pkuzyc changed the title [AutoParallel] fix the diff in lr_scheduler and GPT model Fix the bug when using 0-D tensor in MoE model May 9, 2023
Copy link
Collaborator

@ZHUI ZHUI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@ZHUI ZHUI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@ZHUI ZHUI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@ZHUI ZHUI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@ZHUI ZHUI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ZHUI ZHUI merged commit 80cc859 into PaddlePaddle:develop May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants