[fix]: let FSDP handle model with multiple forward pass and checkpoint #621

min-xu-ai · 2021-04-21T23:04:24Z

This PR adds support of multiple passes of a module in a forward pass and activation checkpoint on the module at the same time.

Right now, the tested cases are:

FSDP(ckpt(module))
FSDP(ckpt(module, FSDP(BN), module))

Next commit will test and enable FSDP(ckpt(), ..., ckpt()) type of cases.

The fix is to add an option to checkpoint_wrapper so that it can keep a counter on the checkpointed modules so that FSDP can check the counter to determine if the backward callback fired from the module is the last one in the bigger, outer backward pass.

A new API to override the pre/post gradient divide factors. It is currently used in the test to make sure numerical match with DDP.

Before submitting

Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
Did you read the contributor guideline?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Fixes part of #617 .

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

min-xu-ai · 2021-04-23T22:57:15Z

tests/nn/data_parallel/test_fsdp_multiple_forward_checkpoint.py

+            model.block1 = auto_wrap_bn(model.block1, single_rank_pg=False)
+            model.block2 = auto_wrap_bn(model.block2, single_rank_pg=False)
+        if with_checkpoint:
+            model.block2 = checkpoint_wrapper(model.block2, maintain_forward_counter=True)


@QuentinDuval, once this PR is merged, you need to use mintain_forward_counter=True in vissl when forward-multiple-times and checkpoint are both used.

sshleifer

LGTM. Note that I don't have context on why pre_backward_hook_has_run used to be a list.

sshleifer · 2021-04-24T20:59:10Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

        if param.grad.requires_grad:
-            raise RuntimeError("FullyShardedDataParallel only works with gradients that don't require grad")
+            raise RuntimeError("FullyShardedDataParallel only works with gradients that don't require gradients")


Is this a thing? Which gradients need gradients!?

Great question! I think it came from @myleott originally. I would love to know more too. :-)

min-xu-ai · 2021-04-24T21:58:54Z

LGTM. Note that I don't have context on why pre_backward_hook_has_run used to be a list.

Thanks for the review @sshleifer! The pre_backward_hook_has_run being a list isn't originally from me, but I think I know why it is that way.

Say for this example:

def func():
   x = 0
   def nest():
       x = 1

The second assignment to x will be to a new local variable inside the nest function. However, if you turn it into:

def func():
   x = [ 0 ]
   def nest():
      x[0] = 1

Then, the second assignment will be a read of the outer x, which ensures the right variable is updated. It is a very clever trick to turn a write into a read first so that python knows which variable you are referring too. Using python's keyword nonlocal would get the same effect, like:

def func():
  x = 0
  def nest():
     nonlocal x
     x = 1

min-xu-ai · 2021-04-24T22:38:26Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

        if not torch.is_grad_enabled():
            return outputs  # don't register hooks if grad isn't enabled

-        pre_backward_hook_has_run = [False]


originally, the variable is local to _register_pre_backward_hooks function and it only prevents callbacks registered within this function. When there are multiple calls to this function, this local flag won't prevent multiple callbacks. Moving this flag to self solves this problem.

min-xu-ai · 2021-04-26T16:49:01Z

I have merged this for now @myleott. Definitely happy to address more comments separately.

Min Xu added 2 commits April 21, 2021 15:56

[fix]: let FSDP handle model with multiple forward pass and checkpoint

b095405

Merge remote-tracking branch 'origin/master' into min/bug_617

16bf375

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 21, 2021

Min Xu added 16 commits April 21, 2021 17:37

try CI again

927aedc

Merge remote-tracking branch 'origin/master' into min/bug_617

bdcf21d

save

bace9bd

Merge remote-tracking branch 'origin/master' into min/bug_617

f095358

save

e020daa

fixed case with bn

73e85d1

Merge remote-tracking branch 'origin/master' into min/bug_617

23fcdc5

minor

69f476c

add the new file

7c5c96f

minor

bf6c7e4

Merge remote-tracking branch 'origin/master' into min/bug_617

1246147

Merge remote-tracking branch 'origin/master' into min/bug_617

3cd8260

added test of a single case, runtime is about 50s

8683f0b

enable all 8 test cases

9fd8b69

cleanup

5e65399

cleanup

c04a066

min-xu-ai requested review from myleott, sshleifer, anj-s and blefaudeux April 23, 2021 21:53

skip flatten case with 1.6 and 1.7

e6246cc

min-xu-ai commented Apr 23, 2021

View reviewed changes

minor

4e0951d

min-xu-ai marked this pull request as ready for review April 23, 2021 22:59

sshleifer approved these changes Apr 24, 2021

View reviewed changes

min-xu-ai commented Apr 24, 2021

View reviewed changes

min-xu-ai merged commit a1612d7 into master Apr 26, 2021

min-xu-ai deleted the min/bug_617 branch April 26, 2021 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix]: let FSDP handle model with multiple forward pass and checkpoint #621

[fix]: let FSDP handle model with multiple forward pass and checkpoint #621

min-xu-ai commented Apr 21, 2021 •

edited

Loading

min-xu-ai Apr 23, 2021

sshleifer left a comment •

edited

Loading

sshleifer Apr 24, 2021

min-xu-ai Apr 24, 2021

min-xu-ai commented Apr 24, 2021

min-xu-ai Apr 24, 2021

min-xu-ai commented Apr 26, 2021

[fix]: let FSDP handle model with multiple forward pass and checkpoint #621

[fix]: let FSDP handle model with multiple forward pass and checkpoint #621

Conversation

min-xu-ai commented Apr 21, 2021 • edited Loading

Before submitting

What does this PR do?

PR review

Did you have fun?

min-xu-ai Apr 23, 2021

Choose a reason for hiding this comment

sshleifer left a comment • edited Loading

Choose a reason for hiding this comment

sshleifer Apr 24, 2021

Choose a reason for hiding this comment

min-xu-ai Apr 24, 2021

Choose a reason for hiding this comment

min-xu-ai commented Apr 24, 2021

min-xu-ai Apr 24, 2021

Choose a reason for hiding this comment

min-xu-ai commented Apr 26, 2021

min-xu-ai commented Apr 21, 2021 •

edited

Loading

sshleifer left a comment •

edited

Loading