[FSDP] relax checking root condition #620

shuyingsunshine21 · 2021-04-21T19:33:02Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
Did you read the contributor guideline?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

When integrating with Lightning, found out that as model is nested in FSDP wrapper after training, and when we call trainer.test(model), it failed the assertion that the root is not set. In this case, non-root has already been set. We relax this assertion in this PR. (link to discussion: Lightning-AI/pytorch-lightning#6152)

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

shuyingsunshine21 · 2021-04-21T19:34:42Z

cc @min-xu-ai , @SeanNaren

min-xu-ai

Nice. Thanks! Once CI passes I can merge if you don't see the merge button.

min-xu-ai · 2021-04-21T19:44:16Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

-                if not self._has_params:
-                    assert m._queue_wait_for_post_backward_closure is None
-                    m._queue_wait_for_post_backward_closure = self._queue_wait_for_post_backward
+                assert m._is_root is None or m._is_root == False


Suggested change

assert m._is_root is None or m._is_root == False

# We relax the assert for non-root instance. A lightning unit test triggers this otherwise.

assert m._is_root is None or m._is_root == False

shuyingsunshine21 · 2021-04-21T21:21:00Z

Nice. Thanks! Once CI passes I can merge if you don't see the merge button.

looks like CI passes.

ananthsub · 2021-04-21T21:55:55Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

-                if not self._has_params:
-                    assert m._queue_wait_for_post_backward_closure is None
-                    m._queue_wait_for_post_backward_closure = self._queue_wait_for_post_backward
+                # We relax the assert for non-root instance. A lightning unit test triggers this otherwise.


not sure if we need to mention lightning here inside of fairscale. eventually this comment will also be unclear what it was relaxed from or why its relaxed

I don't want to leave no comment so that it is might be hard to figure out in the future why was this relaxed. Maybe you can suggest something clearer?

SeanNaren · 2021-04-21T22:54:48Z

@ananthsub makes a good point, apologies for being lazy on this one, here is a pure pytorch test we can use to simulate this, and remove the PL line:

import os
import unittest
from unittest import mock

import torch
import torch.nn as nn
from fairscale.nn import FullyShardedDataParallel
import torch.nn.functional as F


@mock.patch.dict(os.environ, {"MASTER_ADDR": "localhost", "MASTER_PORT": "1337"}, clear=True)
@unittest.skipIf(not torch.cuda.is_available(), "Test Requires CUDA")
def test_wrapping_module():
    """
    This test simulates wrapping the module after training to run inference.
    This is required in cases where later in a session, the model is wrapped again in FSDP but
    contains nested FSDP wrappers within the module.
    """
    device = torch.device("cuda")
    torch.cuda.set_device(0)

    torch.distributed.init_process_group(backend="nccl", rank=0, world_size=1)

    module = nn.Sequential(
        FullyShardedDataParallel(nn.Linear(5, 5)),
    )

    model = FullyShardedDataParallel(module).to(device)

    input = torch.rand((1, 5), dtype=torch.float).to(device)
    output = model(input)
    loss = F.mse_loss(input, output)
    loss.backward()

    model = FullyShardedDataParallel(module).to(device)
    second_output = model(input)

    assert torch.allclose(output, second_output)

    torch.distributed.destroy_process_group()

We can add this as a unit test to ensure this behaviour works!

min-xu-ai · 2021-04-21T23:03:10Z

@SeanNaren, this is lovely, I can add a test file once my bug 617 work is done.

@shuyingsunshine21, you can add a new test file as well, but please use other fsdp tests as an example. We can't use Sean's code above as is since we don't want to use hard coded tcp port which may cause test port conflict on the same machine when multiple people are running it. Also, a new test file needs to be added to one of the test list text file under tests dir. It is totally fine to leave it to me if you can wait on it a bit.

…into lightning_fsdp_root_relax

shuyingsunshine21 · 2021-04-22T19:10:15Z

the CI test failure might be related to #624

min-xu-ai · 2021-04-22T20:06:05Z

the CI test failure might be related to #624

yeah, sorry about that. will be merged within the next hour.

min-xu-ai · 2021-04-22T20:07:04Z

tests/ci_test_list_3.txt

@@ -5,6 +5,7 @@ tests/nn/data_parallel/test_fsdp_no_sync.py
 tests/nn/data_parallel/test_fsdp_summon_full_params.py
 tests/nn/data_parallel/test_fsdp_input.py
 tests/nn/data_parallel/test_fsdp_multiple_forward.py
+tests/nn/data_parallel/test_fsdp_multiple_wrapping.py


can you put the file in list_1.txt since it is shortest right now.

was curious about where to put and what is the difference also (so put in similarly place as rest of the fsdp).

shuyingsunshine21 · 2021-04-22T20:18:41Z

the CI test failure might be related to #624

yeah, sorry about that. will be merged within the next hour.

no problem

min-xu-ai · 2021-04-22T22:45:54Z

Please merge with master. I think Ben has fixed it already. My PR is going in soon too.

…into lightning_fsdp_root_relax

min-xu-ai · 2021-04-23T01:00:07Z

tests/nn/data_parallel/test_fsdp_multiple_wrapping.py

 from fairscale.nn.data_parallel import FullyShardedDataParallel as FSDP
 from fairscale.nn.data_parallel import TrainingState
-from fairscale.utils.testing import dist_init, teardown, torch_version
+from fairscale.utils.testing import dist_init, teardown, torch_version, skip_if_no_cuda
+from torch.nn import Linear, Module, Sequential


is this moved by black/isort? If not, CI will fail again. Our CI is pretty strict, it will take a bit of time to get used to. But it is really good once get used to. :-)

thanks, after black forgot to do isort.

shuyingsunshine21 · 2021-04-23T01:24:36Z

weird thing why it does not trigger CI

min-xu-ai · 2021-04-23T01:33:38Z

Yeah, I have seen it today too. Perhaps a CI bug. I end up made and pushed a new commit to trigger it.

shuyingsunshine21 · 2021-04-23T02:49:22Z

all passed :)

min-xu-ai · 2021-04-23T03:29:22Z

all passed :)

Nice! Thanks again!

relax checking root condition

1813311

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 21, 2021

SeanNaren approved these changes Apr 21, 2021

View reviewed changes

min-xu-ai approved these changes Apr 21, 2021

View reviewed changes

formatting

6baf139

ananthsub reviewed Apr 21, 2021

View reviewed changes

ananthsub mentioned this pull request Apr 21, 2021

Update FairScale on CI Lightning-AI/pytorch-lightning#7017

Merged

11 tasks

Shuying Sun added 5 commits April 21, 2021 22:35

Merge branch 'master' of https://github.com/facebookresearch/fairscale …

c2dd952

…into lightning_fsdp_root_relax

add unittest

e16c499

add unittest to ci test list

0abce4a

isort for import of unittest

39703fe

format black .

5ff9f0d

min-xu-ai reviewed Apr 22, 2021

View reviewed changes

Shuying Sun added 3 commits April 22, 2021 17:11

move test to list 1

255b57e

Merge branch 'master' of https://github.com/facebookresearch/fairscale …

39c63fe

…into lightning_fsdp_root_relax

add skip no cuda

0dcba4d

min-xu-ai reviewed Apr 23, 2021

View reviewed changes

black and isort

da366c9

min-xu-ai merged commit d3b86d6 into facebookresearch:master Apr 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP] relax checking root condition #620

[FSDP] relax checking root condition #620

shuyingsunshine21 commented Apr 21, 2021 •

edited

Loading

shuyingsunshine21 commented Apr 21, 2021

min-xu-ai left a comment

min-xu-ai Apr 21, 2021

shuyingsunshine21 commented Apr 21, 2021

ananthsub Apr 21, 2021

min-xu-ai Apr 21, 2021

SeanNaren commented Apr 21, 2021 •

edited

Loading

min-xu-ai commented Apr 21, 2021

shuyingsunshine21 commented Apr 22, 2021

min-xu-ai commented Apr 22, 2021

min-xu-ai Apr 22, 2021

shuyingsunshine21 Apr 22, 2021

shuyingsunshine21 commented Apr 22, 2021

min-xu-ai commented Apr 22, 2021

min-xu-ai Apr 23, 2021

shuyingsunshine21 Apr 23, 2021

shuyingsunshine21 commented Apr 23, 2021

min-xu-ai commented Apr 23, 2021

shuyingsunshine21 commented Apr 23, 2021

min-xu-ai commented Apr 23, 2021

	assert m._is_root is None or m._is_root == False
	# We relax the assert for non-root instance. A lightning unit test triggers this otherwise.
	assert m._is_root is None or m._is_root == False

[FSDP] relax checking root condition #620

[FSDP] relax checking root condition #620

Conversation

shuyingsunshine21 commented Apr 21, 2021 • edited Loading

Before submitting

What does this PR do?

PR review

Did you have fun?

shuyingsunshine21 commented Apr 21, 2021

min-xu-ai left a comment

Choose a reason for hiding this comment

min-xu-ai Apr 21, 2021

Choose a reason for hiding this comment

shuyingsunshine21 commented Apr 21, 2021

ananthsub Apr 21, 2021

Choose a reason for hiding this comment

min-xu-ai Apr 21, 2021

Choose a reason for hiding this comment

SeanNaren commented Apr 21, 2021 • edited Loading

min-xu-ai commented Apr 21, 2021

shuyingsunshine21 commented Apr 22, 2021

min-xu-ai commented Apr 22, 2021

min-xu-ai Apr 22, 2021

Choose a reason for hiding this comment

shuyingsunshine21 Apr 22, 2021

Choose a reason for hiding this comment

shuyingsunshine21 commented Apr 22, 2021

min-xu-ai commented Apr 22, 2021

min-xu-ai Apr 23, 2021

Choose a reason for hiding this comment

shuyingsunshine21 Apr 23, 2021

Choose a reason for hiding this comment

shuyingsunshine21 commented Apr 23, 2021

min-xu-ai commented Apr 23, 2021

shuyingsunshine21 commented Apr 23, 2021

min-xu-ai commented Apr 23, 2021

shuyingsunshine21 commented Apr 21, 2021 •

edited

Loading

SeanNaren commented Apr 21, 2021 •

edited

Loading