Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Unified checkpoint] Support sharding_comm_overlap #9391

Merged

Conversation

DesmonDay
Copy link
Contributor

PR types

Function optimization

PR changes

Others

Description

Support sharding_comm_overlap

DesmonDay and others added 20 commits October 16, 2024 15:29
* update async save signal

* fix async save hang
…addlePaddle#9034)

* [Unified Checkpoint] speed up loading checkpoint by multi thread

* [Unified CHeckpoint] speed up load by multi-thread

* [Unified CHeckpoint] speed up load by multi-thread

* [Unified CHeckpoint] speed up load by multi-thread

* Unified CHeckpoint] speed up loading checkpoint by  multi-thread

* Unified CHeckpoint] speed up loading checkpoint by  multi-thread

* Unified CHeckpoint] speed up loading checkpoint by  multi-thread

* Unified CHeckpoint] speed up loading checkpoint by  multi-thread

* Unified CHeckpoint] speed up loading checkpoint by  multi-thread

* Unified CHeckpoint] speed up loading checkpoint by  multi-thread
…9240)

* [Unified checkpoint] update optimizer async save signal

* update paddlepaddle

* split param

* add save for split param

* fix save split_param

* add load uc split_param

* update uc files

* update uc files

* update split_param loading

* mkdir unified_checkpoint directory

* rename file

* update async handler

* update files

---------

Co-authored-by: gongenlei <gongenlei@baidu.com>
…save_info.json location (PaddlePaddle#9321)

* [Unified checkpoint] update optimizer async save signal

* update async_save_info.json file place
* fix empty state_dict

* update sharding split_parma
Copy link

paddle-bot bot commented Nov 7, 2024

Thanks for your contribution!

@DesmonDay DesmonDay force-pushed the origin_release_3.0-beta2 branch from 01e491d to 5e67a54 Compare November 7, 2024 12:57
Copy link

codecov bot commented Nov 7, 2024

Codecov Report

Attention: Patch coverage is 19.23077% with 21 lines in your changes missing coverage. Please review.

Project coverage is 52.73%. Comparing base (d02c406) to head (e2c5765).
Report is 1 commits behind head on release/3.0-beta2.

Files with missing lines Patch % Lines
paddlenlp/trainer/trainer.py 0.00% 5 Missing ⚠️
...r/unified_checkpoint/sharding_split_param_utils.py 37.50% 5 Missing ⚠️
paddlenlp/trainer/trainer_utils.py 20.00% 4 Missing ⚠️
paddlenlp/trainer/training_args.py 0.00% 2 Missing ⚠️
...nlp/trainer/unified_checkpoint/check_completion.py 0.00% 2 Missing ⚠️
paddlenlp/trainer/unified_checkpoint/utils.py 33.33% 2 Missing ⚠️
paddlenlp/trainer/unified_checkpoint/load_local.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@                  Coverage Diff                  @@
##           release/3.0-beta2    #9391      +/-   ##
=====================================================
- Coverage              52.74%   52.73%   -0.01%     
=====================================================
  Files                    666      666              
  Lines                 107023   107032       +9     
=====================================================
+ Hits                   56446    56447       +1     
- Misses                 50577    50585       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lugimzzz lugimzzz force-pushed the origin_release_3.0-beta2 branch from 5e67a54 to 37f4bc5 Compare November 8, 2024 05:45
@lugimzzz lugimzzz force-pushed the origin_release_3.0-beta2 branch from 37f4bc5 to 5a03c3b Compare November 8, 2024 05:46
@DesmonDay DesmonDay merged commit 5f4dd96 into PaddlePaddle:release/3.0-beta2 Nov 14, 2024
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants