Test optimizer to device #20062

corwinjoy · 2024-07-09T03:12:24Z

What does this PR do?

Pursuant to #19955
add an extended test for _optimizer_to_device that explicitly tests moving the optimizer across devices.

Fixes #19955

Before submitting

Was this discussed/agreed via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--20062.org.readthedocs.build/en/20062/

janeyx99 · 2024-07-16T05:03:25Z

tests/tests_fabric/utilities/test_optimizer.py

+
+    # Try from_dict
+    # These all pretend that we have an appropriate prototype, I don't think we can actually do this since
+    # all we may have is a CPU pickle


The test we have for pytorch load_state_dict being able to read a CPU checkpoint into an appropriate GPU optimizer is here: https://github.com/pytorch/pytorch/blob/main/test/test_optim.py#L1545-L1574

The code above is also how I expect checkpointing to happen, without the need of an explicit move to device.

I have updated this test to be more explicit about what is going on, please take a look and see if it makes sense since the test you linked doesn't look at thorough as far as I can tell.

Yea, the case we test is moving from CPU to GPU, and I see you test more combinations.

janeyx99 · 2024-08-12T18:55:44Z

tests/tests_fabric/utilities/test_optimizer.py

+        # Use from_dict with cpu prototype, fused = True
+        opt_gpu_dict = optimizer_on_device[gpu_device + "_fused_True"].state_dict()
+        cpu_prototype = copy.deepcopy(optimizer_on_device["cpu"])
+        cpu_prototype.load_state_dict(opt_gpu_dict)  # This should give an error / refuse to allow fused = True


FYI, for older versions of torch this should indeed be not allowed/would break. But since torch 2.4, there is a fused CPU Adam(W)/SGD/Adagrad, so fused=True on CPU for these optimizers would be valid.

corwinjoy added 2 commits July 8, 2024 18:55

Add tests showing GPU to CPU copies

17bded8

Merge branch 'Lightning-AI:master' into test_optimizer_to_device

cdd1556

github-actions bot added the fabric lightning.fabric.Fabric label Jul 9, 2024

corwinjoy mentioned this pull request Jul 9, 2024

Adam optimizer is slower after loading model from checkpoint #19955

Closed

Add explicit test for loading checkpoint and running on new device.

b700bc1

github-actions bot added the pl Generic label for PyTorch Lightning package label Jul 16, 2024

janeyx99 reviewed Jul 16, 2024

View reviewed changes

Add further checkpoint tests and replace _optimizer_to_device with pass

34c5d97

corwinjoy mentioned this pull request Aug 5, 2024

Remove the optimizer_to_device logic if possible #20165

Open

janeyx99 reviewed Aug 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test optimizer to device #20062

Test optimizer to device #20062

corwinjoy commented Jul 9, 2024 •

edited by github-actions bot

Loading

janeyx99 Jul 16, 2024

corwinjoy Jul 17, 2024

janeyx99 Aug 12, 2024

janeyx99 Aug 12, 2024

Test optimizer to device #20062

Are you sure you want to change the base?

Test optimizer to device #20062

Conversation

corwinjoy commented Jul 9, 2024 • edited by github-actions bot Loading

What does this PR do?

PR review

janeyx99 Jul 16, 2024

Choose a reason for hiding this comment

corwinjoy Jul 17, 2024

Choose a reason for hiding this comment

janeyx99 Aug 12, 2024

Choose a reason for hiding this comment

janeyx99 Aug 12, 2024

Choose a reason for hiding this comment

corwinjoy commented Jul 9, 2024 •

edited by github-actions bot

Loading