destroy process group in `end_training` #3012

SunMarc · 2024-08-14T11:28:30Z

What does this PR do ?

With the latest version of torch, when we use multi-gpu, torch will trigger a warning asking us to call destoy_process_group(). This PR fixes this by adding that in end_training method. Note that this function needs to be called on all process.

For trackers, we are already only executing the methods on the main process. See here. So it should be safe to remove the on_main_process decorator in the end_training method.
However, I see for WandBTracker that main_process_only = False. Is there a specific reason @muellerzr ?

HuggingFaceDocBuilderDev · 2024-08-14T11:33:52Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BenjaminBossan

With the removal of @on_main_process, would tracker.finish() not be called on all processes where previously that was only called on the main process?

muellerzr

Nice!

muellerzr · 2024-08-15T11:57:59Z

All of the trackers come loaded with @on_main_process decorators already for their functions (if needed/applicable) so this is fine how it is/the right method: https://github.com/huggingface/accelerate/blob/main/src/accelerate/tracking.py#L571-L576

muellerzr · 2024-08-15T12:00:07Z

src/accelerate/accelerator.py

@@ -2678,11 +2678,10 @@ def log(self, values: dict, step: int | None = None, log_kwargs: dict | None = {
        for tracker in self.trackers:
            tracker.log(values, step=step, **log_kwargs.get(tracker.name, {}))

-    @on_main_process
    def end_training(self):


Note that eventually we'll get to a point where users will want to pass in their own pg's and init them, e.g. this is needed for hybrid shard w/ FSDP. So we may need to take that in there potentially. Will see how it ends up looking

destroy process group

db2a773

SunMarc requested a review from muellerzr August 14, 2024 11:28

rephrase

472fe6f

SunMarc added 2 commits August 14, 2024 16:12

style

a0a4d8b

fix on_main_process

2856b97

BenjaminBossan reviewed Aug 15, 2024

View reviewed changes

muellerzr approved these changes Aug 15, 2024

View reviewed changes

muellerzr reviewed Aug 15, 2024

View reviewed changes

muellerzr merged commit 589fddd into main Aug 15, 2024
28 checks passed

muellerzr deleted the destroy_process_group branch August 15, 2024 12:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

destroy process group in `end_training` #3012

destroy process group in `end_training` #3012

SunMarc commented Aug 14, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 14, 2024

BenjaminBossan left a comment

muellerzr left a comment

muellerzr commented Aug 15, 2024

muellerzr Aug 15, 2024

destroy process group in end_training #3012

destroy process group in end_training #3012

Conversation

SunMarc commented Aug 14, 2024 • edited Loading

What does this PR do ?

HuggingFaceDocBuilderDev commented Aug 14, 2024

BenjaminBossan left a comment

Choose a reason for hiding this comment

muellerzr left a comment

Choose a reason for hiding this comment

muellerzr commented Aug 15, 2024

muellerzr Aug 15, 2024

Choose a reason for hiding this comment

destroy process group in `end_training` #3012

destroy process group in `end_training` #3012

SunMarc commented Aug 14, 2024 •

edited

Loading