Multigpu: Fix gradient accumulation and learning rate aggregation #3583

jeffpicard · 2024-12-17T23:35:15Z

Hi! This PR fixes 3 bugs with multi-gpu training:

When using gradient accumulation (mini_batch_chunk_size), this PR disables gradient syncing for allforward/backward passes except the last one before the step. These intermediate syncs are unnecessary and add communication that decreases efficiency. In my experience, this effect was significant enough to go from multi_gpu not giving any speedup, to a 80%-of-linear speedup (mini_batch_chunk_size ~= 4, sentence_length ~= 500, model=xlm-roberta-base, gpu_memory ~= 24g). This is similar to what's done here.
Sums rather than averages gradients from multiple processes. Flair tracks losses / grads as the sum of the whole batch. However, Torch DDP averages the whole batch when aggregating multiple processes. This led to the effective learning rate being 1 / n_gpus smaller when using multi_gpu=True, but ideally it should stay the same to let knobs be turned independently, and this PR does that.
Fixes a deadlock in the checkpoint plugin. When using multi_gpu=True and save_model_each_k_epochs>0 training will freeze. Currently the plugin runs barrier which waits for other processes to catchup, but the plugin is only run on the main process, so they never will.

alanakbik · 2024-12-19T13:16:11Z

@jeffpicard thanks for improving this!

jeffpicard added 2 commits December 17, 2024 14:20

fix: Gradient accumulation skips syncs to increase speed

00b0d36

fix: Sum learning rate instead of averaging on multi gpu sync

56b2a9e

jeffpicard mentioned this pull request Dec 17, 2024

Fix gradient accumulation and learning rate aggregation ZipRecruiter/flair#1

Merged

Fix deadlock in checkpoint plugin

a2edb9e

jeffpicard changed the title ~~Fix gradient accumulation and learning rate aggregation~~ Multigpu: Fix gradient accumulation and learning rate aggregation Dec 18, 2024

alanakbik merged commit 0becfed into flairNLP:master Dec 19, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multigpu: Fix gradient accumulation and learning rate aggregation #3583

Multigpu: Fix gradient accumulation and learning rate aggregation #3583

jeffpicard commented Dec 17, 2024 •

edited

Loading

alanakbik commented Dec 19, 2024

Multigpu: Fix gradient accumulation and learning rate aggregation #3583

Multigpu: Fix gradient accumulation and learning rate aggregation #3583

Conversation

jeffpicard commented Dec 17, 2024 • edited Loading

alanakbik commented Dec 19, 2024

jeffpicard commented Dec 17, 2024 •

edited

Loading