[Unified Checkpoint] Add split param and refactor code #9240

DesmonDay · 2024-10-10T12:02:05Z

PR types

New features

PR changes

Others

Description

Support sharding stage1 v2 for unified checkpoint.
Refactor uc code.

…nto develop

…nto add_split_param

paddle-bot · 2024-10-10T12:02:10Z

Thanks for your contribution!

CLAassistant · 2024-10-10T12:02:11Z

All committers have signed the CLA.

codecov · 2024-10-10T12:33:48Z

Codecov Report

Attention: Patch coverage is 11.13549% with 1620 lines in your changes missing coverage. Please review.

Project coverage is 52.84%. Comparing base (81ffc78) to head (dbd13df).
Report is 3 commits behind head on develop.

Files with missing lines	Patch %	Lines
paddlenlp/trainer/unified_checkpoint/utils.py	12.07%	364 Missing ⚠️
...p/trainer/unified_checkpoint/unified_checkpoint.py	11.34%	297 Missing ⚠️
...ddlenlp/trainer/unified_checkpoint/load_dynamic.py	9.44%	259 Missing ⚠️
...r/unified_checkpoint/sharding_split_param_utils.py	7.97%	173 Missing ⚠️
...nlp/trainer/unified_checkpoint/check_completion.py	9.37%	145 Missing ⚠️
...dlenlp/trainer/unified_checkpoint/async_handler.py	11.32%	141 Missing ⚠️
paddlenlp/trainer/unified_checkpoint/load_local.py	12.12%	116 Missing ⚠️
...rainer/unified_checkpoint/load_save_single_card.py	15.32%	116 Missing ⚠️
paddlenlp/utils/nested.py	14.28%	6 Missing ⚠️
paddlenlp/trainer/training_args.py	0.00%	3 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #9240      +/-   ##
===========================================
+ Coverage    52.78%   52.84%   +0.06%     
===========================================
  Files          661      669       +8     
  Lines       106945   107240     +295     
===========================================
+ Hits         56450    56671     +221     
- Misses       50495    50569      +74

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

DrownFish19 · 2024-10-11T13:13:25Z

paddlenlp/trainer/plugins/unified_checkpoint.py

@@ -909,7 +983,160 @@ def unified_checkpoint_into_shards(
    return state_dict, shard_file, sharded_index


+def load_unified_optimizer_split_param(args, model, optimizer, resume_from_checkpoint):


这个函数是不是和load_unified_optimizer_locally大部分逻辑相似

目前初步开发中，后续会修改

…nto add_split_param

ZHUI · 2024-10-25T03:00:15Z

paddlenlp/trainer/plugins/unified_checkpoint_dynamic.py

@@ -0,0 +1,493 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.


文件比较多，干脆直接单独建一个文件夹吧 paddlenlp/trainer/unified_checkpoint/

ZHUI · 2024-10-25T03:05:08Z

paddlenlp/trainer/plugins/unified_checkpoint.py

+    load_single_card_optimizer,
+    save_single_card_checkpoint,
+    save_single_card_optimizer,
+)


local load的函数也是在这个文件对吗？要不要也拆出去？

ZHUI · 2024-10-25T03:05:45Z

paddlenlp/trainer/plugins/unified_checkpoint.py

@@ -406,30 +402,21 @@ def load_unified_checkpoint(self, model, optimizer, resume_from_checkpoint: str)
            None
        """
        if paddle.distributed.get_world_size() <= 1:
-            load_single_card_checkpoint(self.args, model, resume_from_checkpoint)


为什么args 不需要了？

单卡的这个加载不需要读args，是多余的

ZHUI · 2024-10-25T03:07:57Z

paddlenlp/trainer/plugins/unified_checkpoint.py

            return

        if self.args.dataset_rank == 0 or self.args.use_expert_parallel:
            load_unified_checkpoint_locally(self.args, model, resume_from_checkpoint, safe_serialization=True)

-    def save_non_merge_optimizer(self, model, optimizer, output_dir, signal_dir):
+    def save_non_merge_optimizer(self, model, optim_state_dict, master_weights, output_dir, signal_dir):


提取master_weights的位置发生了改变？

ZHUI · 2024-10-25T03:13:26Z

paddlenlp/trainer/plugins/unified_checkpoint.py

+        args.sharding_parallel_degree > 1
+        and ShardingOption.SHARD_OP in args.sharding
+        and "split_param" in args.sharding_parallel_config
+    ):


这个判断直接定义成一个函数吧，出现很多次了。

ZHUI · 2024-10-25T03:19:37Z

paddlenlp/trainer/plugins/unified_checkpoint_sharding_v2.py

+
+def distributed_send_recv_splited_param(
+    state_dict, partial_tensor_list, param_shape_info, send_table, recv_table, is_master_weights=False
+):


这个函数是去合并拆开的参数吗？

是的，我把函数名改一下吧，改成 merge_splited_param

ZHUI · 2024-10-25T03:22:00Z

paddlenlp/trainer/plugins/unified_checkpoint_sharding_v2.py

+    return optim_state_dict, master_weights
+
+
+def load_unified_optimizer_split_param(model, optimizer, resume_from_checkpoint):


处理 tp 合并的代码，还在原来的 unified_checkpoint 主入口那边是不是？

ZHUI · 2024-10-25T03:22:36Z

paddlenlp/trainer/plugins/unified_checkpoint_sharding_v2.py

+    get_optimizer_shard_files,
+    mapping_optimizer_tp_actions,
+)
+


加 __all__

ZHUI · 2024-10-25T03:24:06Z

paddlenlp/trainer/plugins/unified_checkpoint_single_card.py

+)
+
+
+def save_file_sync(state_dict, path):


这些类似的函数，可以放公共地方吗？

因为没有给单卡的支持异步保存，所以单独给他写了一个小函数，单卡专用的，后续再看看怎么合并吧，这次先不处理了。

ZHUI · 2024-10-25T03:26:11Z

paddlenlp/trainer/plugins/unified_checkpoint_single_card.py

+    # save generation config
+    if model_to_save.can_generate():
+        model_to_save.generation_config.save_pretrained(output_dir)
+


这些个 config 的保存，看要不要也封装成公共函数吧，这样修改不容易修改漏掉。现在分支比较多，

…nto add_split_param

ZHUI · 2024-10-25T09:34:40Z

paddlenlp/trainer/unified_checkpoint/__init__.py

+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .unified_checkpoint import UnifiedCheckpointHandler


__all__ 加一下？

wawltor

LGTM

…9240) * [Unified checkpoint] update optimizer async save signal * update paddlepaddle * split param * add save for split param * fix save split_param * add load uc split_param * update uc files * update uc files * update split_param loading * mkdir unified_checkpoint directory * rename file * update async handler * update files --------- Co-authored-by: gongenlei <gongenlei@baidu.com>

DesmonDay and others added 8 commits August 21, 2024 14:56

[Unified checkpoint] update optimizer async save signal

5451d31

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

4f0b61a

…nto develop

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

68470aa

…nto develop

update paddlepaddle

15e83e2

split param

6837b2f

add save for split param

633d742

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

55186d7

…nto add_split_param

fix save split_param

b6aa309

ZHUI requested review from ZHUI and DrownFish19 October 11, 2024 06:42

add load uc split_param

bf5d72b

DrownFish19 reviewed Oct 11, 2024

View reviewed changes

DesmonDay changed the title ~~[Unified Checkpoint] Add split param~~ [WIP][Unified Checkpoint] Add split param Oct 12, 2024

DesmonDay added 2 commits October 14, 2024 11:40

update uc files

9fdaae2

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

9a210db

…nto add_split_param

DesmonDay force-pushed the add_split_param branch 2 times, most recently from 3abfe71 to 9bce15b Compare October 14, 2024 09:13

DesmonDay changed the title ~~[WIP][Unified Checkpoint] Add split param~~ [WIP][Unified Checkpoint] Add split param, refactor code Oct 14, 2024

DesmonDay changed the title ~~[WIP][Unified Checkpoint] Add split param, refactor code~~ [WIP][Unified Checkpoint] Add split param and refactor code Oct 14, 2024

update uc files

19071ef

DesmonDay force-pushed the add_split_param branch from 9bce15b to 19071ef Compare October 14, 2024 11:24

DesmonDay added 2 commits October 15, 2024 11:04

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

223e089

…nto add_split_param

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

ae9ddce

…nto add_split_param

DesmonDay force-pushed the add_split_param branch from ec6a76a to ae9ddce Compare October 16, 2024 09:31

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

4ab0df1

…nto add_split_param

DesmonDay force-pushed the add_split_param branch from ff0ebc2 to 0d10c4c Compare October 24, 2024 08:54

update split_param loading

cbbc074

DesmonDay force-pushed the add_split_param branch from 0d10c4c to cbbc074 Compare October 24, 2024 09:07

ZHUI reviewed Oct 25, 2024

View reviewed changes

DesmonDay force-pushed the add_split_param branch from dc4c75a to 862f86b Compare October 25, 2024 08:53

mkdir unified_checkpoint directory

7678fad

DesmonDay force-pushed the add_split_param branch from 862f86b to 7678fad Compare October 25, 2024 09:19

DesmonDay added 2 commits October 25, 2024 17:32

rename file

238888d

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

780040e

…nto add_split_param

ZHUI reviewed Oct 25, 2024

View reviewed changes

update async handler

b219ba6

DesmonDay force-pushed the add_split_param branch from e78fe82 to b219ba6 Compare October 25, 2024 09:52

DesmonDay changed the title ~~[WIP][Unified Checkpoint] Add split param and refactor code~~ [Unified Checkpoint] Add split param and refactor code Oct 25, 2024

update files

dbd13df

DesmonDay force-pushed the add_split_param branch from 30fe038 to dbd13df Compare October 25, 2024 14:16

wawltor approved these changes Oct 28, 2024

View reviewed changes

wawltor merged commit c9d5673 into PaddlePaddle:develop Oct 28, 2024
8 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Unified Checkpoint] Add split param and refactor code #9240

[Unified Checkpoint] Add split param and refactor code #9240

DesmonDay commented Oct 10, 2024 •

edited

Loading

paddle-bot bot commented Oct 10, 2024

CLAassistant commented Oct 10, 2024 •

edited

Loading

codecov bot commented Oct 10, 2024 •

edited

Loading

DrownFish19 Oct 11, 2024

DesmonDay Oct 12, 2024

ZHUI Oct 25, 2024

DesmonDay Oct 25, 2024

ZHUI Oct 25, 2024

ZHUI Oct 25, 2024

DesmonDay Oct 25, 2024

ZHUI Oct 25, 2024

ZHUI Oct 25, 2024

DesmonDay Oct 25, 2024

ZHUI Oct 25, 2024

DesmonDay Oct 25, 2024

ZHUI Oct 25, 2024

DesmonDay Oct 25, 2024

ZHUI Oct 25, 2024

ZHUI Oct 25, 2024

DesmonDay Oct 25, 2024 •

edited

Loading

ZHUI Oct 25, 2024

DesmonDay Oct 25, 2024

ZHUI Oct 25, 2024

DesmonDay Oct 25, 2024

wawltor left a comment

		@@ -909,7 +983,160 @@ def unified_checkpoint_into_shards(
		return state_dict, shard_file, sharded_index


		def load_unified_optimizer_split_param(args, model, optimizer, resume_from_checkpoint):

		@@ -0,0 +1,493 @@
		# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.

		return optim_state_dict, master_weights


		def load_unified_optimizer_split_param(model, optimizer, resume_from_checkpoint):

[Unified Checkpoint] Add split param and refactor code #9240

[Unified Checkpoint] Add split param and refactor code #9240

Conversation

DesmonDay commented Oct 10, 2024 • edited Loading

PR types

PR changes

Description

paddle-bot bot commented Oct 10, 2024

CLAassistant commented Oct 10, 2024 • edited Loading

codecov bot commented Oct 10, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DesmonDay Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wawltor left a comment

Choose a reason for hiding this comment

DesmonDay commented Oct 10, 2024 •

edited

Loading

CLAassistant commented Oct 10, 2024 •

edited

Loading

codecov bot commented Oct 10, 2024 •

edited

Loading

DesmonDay Oct 25, 2024 •

edited

Loading