Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multigpu: Fix gradient accumulation and learning rate aggregation #3583

Merged
merged 3 commits into from
Dec 19, 2024

Conversation

jeffpicard
Copy link
Contributor

@jeffpicard jeffpicard commented Dec 17, 2024

Hi! This PR fixes 3 bugs with multi-gpu training:

  • When using gradient accumulation (mini_batch_chunk_size), this PR disables gradient syncing for allforward/backward passes except the last one before the step. These intermediate syncs are unnecessary and add communication that decreases efficiency. In my experience, this effect was significant enough to go from multi_gpu not giving any speedup, to a 80%-of-linear speedup (mini_batch_chunk_size ~= 4, sentence_length ~= 500, model=xlm-roberta-base, gpu_memory ~= 24g). This is similar to what's done here.
  • Sums rather than averages gradients from multiple processes. Flair tracks losses / grads as the sum of the whole batch. However, Torch DDP averages the whole batch when aggregating multiple processes. This led to the effective learning rate being 1 / n_gpus smaller when using multi_gpu=True, but ideally it should stay the same to let knobs be turned independently, and this PR does that.
  • Fixes a deadlock in the checkpoint plugin. When using multi_gpu=True and save_model_each_k_epochs>0 training will freeze. Currently the plugin runs barrier which waits for other processes to catchup, but the plugin is only run on the main process, so they never will.

@jeffpicard jeffpicard changed the title Fix gradient accumulation and learning rate aggregation Multigpu: Fix gradient accumulation and learning rate aggregation Dec 18, 2024
@alanakbik
Copy link
Collaborator

@jeffpicard thanks for improving this!

@alanakbik alanakbik merged commit 0becfed into flairNLP:master Dec 19, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants