fix ram efficient fsdp init #25686

pacman100 · 2023-08-23T14:27:40Z

What does this PR do?

Currently, when using Trainer if the model is loaded before creating TrainingArguments object, torch distributed process group won't be initialized and as such when FSDP is enabled via accelerate config, it will end up initializing the model with random weights on all ranks as the is_fsdp_enabled_and_dist_rank_0 function will always return False. This results in NaN losses. Quite a journey to uncover this bug. This PR fixes it.

HuggingFaceDocBuilderDev · 2023-08-23T14:51:07Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

Works for me, thanks!

fix ram efficient fsdp init

8e269ff

pacman100 requested a review from sgugger August 23, 2023 14:27

sgugger approved these changes Aug 23, 2023

View reviewed changes

pacman100 merged commit b85b880 into main Aug 24, 2023

pacman100 deleted the smangrul/fsdp-fix branch August 24, 2023 06:00

marr75 mentioned this pull request Sep 7, 2023

Regression: Need to guard torch.distributed.is_initialized with torch.distributed.is_available #26039

Closed

4 tasks

parambharat pushed a commit to parambharat/transformers that referenced this pull request Sep 26, 2023

fix ram efficient fsdp init (huggingface#25686)

04da8ed

blbadger pushed a commit to blbadger/transformers that referenced this pull request Nov 8, 2023

fix ram efficient fsdp init (huggingface#25686)

e3b205f

EduardoPach pushed a commit to EduardoPach/transformers that referenced this pull request Nov 18, 2023

fix ram efficient fsdp init (huggingface#25686)

0b61331

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix ram efficient fsdp init #25686

fix ram efficient fsdp init #25686

pacman100 commented Aug 23, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 23, 2023 •

edited

Loading

sgugger left a comment

fix ram efficient fsdp init #25686

fix ram efficient fsdp init #25686

Conversation

pacman100 commented Aug 23, 2023 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Aug 23, 2023 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

pacman100 commented Aug 23, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 23, 2023 •

edited

Loading