Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make op builder detection adapt to accelerator change #5206

Merged
merged 60 commits into from
Mar 12, 2024

Conversation

delock
Copy link
Collaborator

@delock delock commented Feb 28, 2024

This is an WIP PR that make op builder detection adapt to accelerator change. This is followup of #5173
Currently, DeepSpeed generate installed_ops and compatible_ops at setup time. If the system change to a different accelerator at DeepSpeed launch time, these two list would contain incorrect information.

This PR intend to solve this problem with more flexity ops detection.

  • For installed_ops, DeepSpeed should disable all installed ops if accelerator detected at setup time is different from launch time.
  • For compatible_ops, DeepSpeed should refresh the list for each launch to avoid impact of accelerator change.

In the first step, nv-inference workflow is temporary change to emulate the scenario that the system is setup with CPU_Accelerator, then launch with CUDA_Accelerator. And CPU_Accelerator is modified to make Intel Extension for PyTorch and oneCCL binding for PyTorch not mandatory.

Starting from here we can reconstruct installed_ops and compatible_ops to follow the design above.

@delock
Copy link
Collaborator Author

delock commented Feb 28, 2024

@tjruwase @mrwyattii @loadams FYI

Note this PR also comes with an updated CPU_Accelerator that could work without Intel Extension for PyTorch and oneCCL Binding for PyTorch. It is designed in a way that is adaptive to whether these two packages are installed. Hope this can help us keep one CPU_Accelerator that serve both basic and optimized user scenario.

@tjruwase tjruwase requested review from umchand and removed request for arashb and awan-10 February 28, 2024 20:41
@delock
Copy link
Collaborator Author

delock commented Feb 29, 2024

@tjruwase installed_ops and compatible_ops are now reconstructed as this PR description. Also removed Intel Extension for PyTorch and oneCCL binding for PyTorch installation in CPU inference workflow to test whether CPU Accelerator can work with stock PyTorch. Can you help start the workflow? Thanks!

@delock
Copy link
Collaborator Author

delock commented Feb 29, 2024

Hi @tjruwase , the failure in cpu-inference and nv-inference workflow is due to missing package. I have fixed them with latest CI. Can you help start the workflow? Thanks!

@delock
Copy link
Collaborator Author

delock commented Mar 1, 2024

@tjruwase the missing dependency in nv-inference workflow had been added. Can you help restart the workflow? Thanks!

@loadams
Copy link
Contributor

loadams commented Mar 8, 2024

The reason for cpu-torch-latest is due to most UTs assume fp16 data type that CPU accelerator does not support. I changed some of the tests to use bf16 data type if accelerator support bf16 but not fp16. It would take a while to make all the UT pass.

@delock, thanks for helping with this. The root cause is the following common construct in the UT:

    "fp16": {"enabled": True}

Rather than skipping these tests, I think the best solution is to modify to

    "fp16": { "enabled": get_accelerator().is_fp16_supported()}

What do you think?
@umchand, can you please try this idea in your environment? Thanks!

FYI, I tried this change suggested by @tjruwase and did see the 4 test which were failing pass.

Thanks! @umchand @tjruwase. I have changed more UTs, can you help start the workflow?

Running now

@delock
Copy link
Collaborator Author

delock commented Mar 8, 2024

@loadams the error in nv-inference is quite strange, preferred_dtype is a variable defined in unit/common.py

Maybe define it as a variable is not a good practice. I changed it to a function instead. Can you help restart the workflow? Thanks!

@delock
Copy link
Collaborator Author

delock commented Mar 9, 2024

@loadams The remaining 4 test failure is due to pdsh command not found. pdsh had not been validated for CPU Accelerator, where impi launcher is recommended for multi-node. Nevertheless, I installed pdsh in this workflow and see what happens. Can you help restart the workflow? Thanks!

@delock
Copy link
Collaborator Author

delock commented Mar 10, 2024

Hi,
@tjruwase @loadams, from the error message, this machine had not been configured for multi-node runner. This machine needs host key verification (ssh with authenticated keys). Otherwise we have to skip this test for CPU accelerator

    def test_user_args(cmd):
        p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        out, err = p.communicate()
>       assert "ARG PARSE SUCCESS" in out.decode("utf-8"), f"User args not parsed correctly: {err.decode('utf-8')}"
E       AssertionError: User args not parsed correctly: localhost: Host key verification failed.

E         pdsh@fv-az843-752: localhost: ssh exited with exit code 255
E         /home/runner/work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
E           warnings.warn(
E         

@delock
Copy link
Collaborator Author

delock commented Mar 11, 2024

@tjruwase there is also some UT skipped with CPU accelerator. Each one may need further investigation Will need some help looking at the real reason of the failure and fix them one by one. Should we put this work into seperate PRs and keep these skips with CPU accelerators?

@tjruwase
Copy link
Contributor

@tjruwase there is also some UT skipped with CPU accelerator. Each one may need further investigation Will need some help looking at the real reason of the failure and fix them one by one. Should we put this work into seperate PRs and keep these skips with CPU accelerators?

Yes, it is okay to put that work into a separate PR.

@delock
Copy link
Collaborator Author

delock commented Mar 11, 2024

Hi @tjruwase the latest PR should fix nv-torch-latest failrues.
checkpoint_correctness_verification will need some further change to be more clean: remove fp16 parameter and use dtype parameter instead.

@loadams
Copy link
Contributor

loadams commented Mar 11, 2024

Hi, @tjruwase @loadams, from the error message, this machine had not been configured for multi-node runner. This machine needs host key verification (ssh with authenticated keys). Otherwise we have to skip this test for CPU accelerator

    def test_user_args(cmd):
        p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        out, err = p.communicate()
>       assert "ARG PARSE SUCCESS" in out.decode("utf-8"), f"User args not parsed correctly: {err.decode('utf-8')}"
E       AssertionError: User args not parsed correctly: localhost: Host key verification failed.

E         pdsh@fv-az843-752: localhost: ssh exited with exit code 255
E         /home/runner/work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
E           warnings.warn(
E         

@mrwyattii - what are your thoughts here? But I think that we should probably no install pdsh here and not run those multi-node tests since we don't today?

@delock
Copy link
Collaborator Author

delock commented Mar 12, 2024

A short summary of tests skipped if device_name is "cpu". Total 23 skips in 12 files. We could create a seperate PR, remove these skips, and aim to eliminate all the errors by either fix them in deepspeed, or skip with proper accelerator feature detection.

checkpoint/test_lr_scheduler.py:
26:        if get_accelerator().device_name() == 'cpu':
78:        if get_accelerator().device_name() == 'cpu':
checkpoint/test_zero_optimizer.py:
156:        if zero_stage == 0 and get_accelerator().device_name() == "cpu":
331:        if zero_stage == 0 and get_accelerator().device_name() == "cpu":
checkpoint/test_other_optimizer.py:
75:        if get_accelerator().device_name() == "cpu":
common.py:
88:            assert get_accelerator().device_name() == 'cpu'
launcher/test_user_args.py:
47:    if multi_node and get_accelerator().device_name() == "cpu":
runtime/activation_checkpointing/test_activation_checkpointing.py:
65:    if get_accelerator().device_name() == "cpu":
87:    if get_accelerator().device_name() == "cpu":
runtime/zero/test_zero.py:
94:        if mics_enabled and get_accelerator().device_name() == "cpu":
1320:        if get_accelerator().device_name() == "cpu":
runtime/zero/test_zero_tensor_fragment.py:
147:        if get_accelerator().device_name() == "cpu":
runtime/compile/test_compile_wrapper.py:
75:        if get_accelerator().device_name() == "cpu":
runtime/compile/test_compile_zero.py:
33:        if get_accelerator().device_name() == "cpu":
runtime/compile/test_load_config.py:
77:        if get_accelerator().device_name() == "cpu":
85:        if get_accelerator().device_name() == "cpu":
96:        if get_accelerator().device_name() == "cpu":
104:        if get_accelerator().device_name() == "cpu":
113:        if get_accelerator().device_name() == "cpu":
122:        if get_accelerator().device_name() == "cpu":
runtime/test_ds_config_dict.py:
251:        if get_accelerator().device_name() == "cpu":
runtime/test_data_efficiency.py:
57:        if get_accelerator().device_name() == "cpu":
133:        if get_accelerator().device_name() == "cpu":
178:        if get_accelerator().device_name() == "cpu":

@tjruwase tjruwase added this pull request to the merge queue Mar 12, 2024
Merged via the queue into microsoft:master with commit c08e69f Mar 12, 2024
15 checks passed
rraminen pushed a commit to ROCm/DeepSpeed that referenced this pull request May 9, 2024
This is an WIP PR that make op builder detection adapt to accelerator
change. This is followup of
microsoft#5173
Currently, DeepSpeed generate `installed_ops` and `compatible_ops` at
setup time. If the system change to a different accelerator at DeepSpeed
launch time, these two list would contain incorrect information.

This PR intend to solve this problem with more flexity ops detection.

* For `installed_ops`, DeepSpeed should disable all installed ops if
accelerator detected at setup time is different from launch time.
* For `compatible_ops`, DeepSpeed should refresh the list for each
launch to avoid impact of accelerator change.

In the first step, nv-inference workflow is temporary change to emulate
the scenario that the system is setup with CPU_Accelerator, then launch
with CUDA_Accelerator. And CPU_Accelerator is modified to make Intel
Extension for PyTorch and oneCCL binding for PyTorch not mandatory.

Starting from here we can reconstruct installed_ops and compatible_ops
to follow the design above.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
dbyoung18 pushed a commit to dbyoung18/DeepSpeed that referenced this pull request Jun 11, 2024
This is an WIP PR that make op builder detection adapt to accelerator
change. This is followup of
microsoft#5173
Currently, DeepSpeed generate `installed_ops` and `compatible_ops` at
setup time. If the system change to a different accelerator at DeepSpeed
launch time, these two list would contain incorrect information.

This PR intend to solve this problem with more flexity ops detection.

* For `installed_ops`, DeepSpeed should disable all installed ops if
accelerator detected at setup time is different from launch time.
* For `compatible_ops`, DeepSpeed should refresh the list for each
launch to avoid impact of accelerator change.

In the first step, nv-inference workflow is temporary change to emulate
the scenario that the system is setup with CPU_Accelerator, then launch
with CUDA_Accelerator. And CPU_Accelerator is modified to make Intel
Extension for PyTorch and oneCCL binding for PyTorch not mandatory.

Starting from here we can reconstruct installed_ops and compatible_ops
to follow the design above.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants