Fix dynamo issue #6527

oraluben · 2024-09-12T05:59:40Z

Dynamo use faketensor to trace tensor ops. In some case, the mechanism break compiling with deepspeed.

An example could be found at https://gist.github.com/oraluben/9b8240c2fe482eb4382453d6c97a5f76, to see issues, install deepspeed==0.14.4 instead of my fork

without this PR, llama cannot be compiled.

Detailed explanation:

ZeROOrderedDict
dynamo use deepcopy to copy tensors, which will call object.__reduce__. When copying ZeROOrderedDict, the default implementation do not copy its _parent_module and will lead to failure.
param maybe faketensor and do not have ds_status yet, but during tracing it's ok to just skip the register_external_parameter, it should be done ways before.

tohtana

@oraluben Thank you for offering a great investigation! I think this is a clean and simple solution for the issue.

deepspeed/runtime/zero/parameter_offload.py

oraluben · 2024-09-13T03:46:19Z

torch.compiler.is_compiling() should be better for this case, however there's still issue, presumably on dynamo side (since we have faketensor we're definitely tracing). So keep it for now.

[rank1]:   File "/home/yyc/accelerate-demo/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1720, in __getattr__
[rank1]:     return _parameters[name]
[rank1]:   File "/home/yyc/repo/DeepSpeed/deepspeed/runtime/zero/parameter_offload.py", line 67, in __getitem__
[rank1]:     if not is_compiling() and param.ds_status == ZeroParamStatus.NOT_AVAILABLE:
[rank1]: torch._dynamo.exc.TorchRuntimeError: Failed running call_module L__self___self_attn_q_proj(*(FakeTensor(..., device='cuda:1', size=(1, s0, 4096), dtype=torch.float16,
[rank1]:            grad_fn=<MulBackward0>),), **{}):
[rank1]: 'FakeTensor' object has no attribute 'ds_status'

my patch in deepspeed.runtime:

diff --git a/deepspeed/runtime/compiler.py b/deepspeed/runtime/compiler.py
index 879c0a1a..3994c1f5 100644
--- a/deepspeed/runtime/compiler.py
+++ b/deepspeed/runtime/compiler.py
@@ -10,6 +10,15 @@ def is_compile_supported():
     return hasattr(torch, "compiler") and hasattr(torch.nn.Module, "compile")
 
 
+def is_compiling():
+    if not is_compile_supported():
+        return False
+    elif hasattr(torch.compiler, 'is_compiling'):  # torch >= 2.3
+        return torch.compiler.is_compiling()
+    else:
+        return torch._dynamo.is_compiling()
+
+
 def disable(func):
     if is_compile_supported():
         return torch.compiler.disable(func)

oraluben added 3 commits September 12, 2024 13:58

Fix dynamo issue in llama

76449bf

Merge branch 'master' into fix-compile-deepcopy

156c092

cleanup

b513045

oraluben marked this pull request as ready for review September 12, 2024 06:17

oraluben requested a review from tjruwase as a code owner September 12, 2024 06:17

oraluben changed the title ~~Fix dynamo issue in llama~~ Fix dynamo issue Sep 12, 2024

tjruwase requested a review from tohtana September 12, 2024 15:57

tohtana approved these changes Sep 12, 2024

View reviewed changes

loadams reviewed Sep 12, 2024

View reviewed changes

deepspeed/runtime/zero/parameter_offload.py Outdated Show resolved Hide resolved

Fix for python < 3.8

78b29c2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dynamo issue #6527

Fix dynamo issue #6527

oraluben commented Sep 12, 2024 •

edited

Loading

tohtana left a comment

oraluben commented Sep 13, 2024 •

edited

Loading

Fix dynamo issue #6527

Are you sure you want to change the base?

Fix dynamo issue #6527

Conversation

oraluben commented Sep 12, 2024 • edited Loading

tohtana left a comment

Choose a reason for hiding this comment

oraluben commented Sep 13, 2024 • edited Loading

oraluben commented Sep 12, 2024 •

edited

Loading

oraluben commented Sep 13, 2024 •

edited

Loading