Make op builder detection adapt to accelerator change #5206

delock · 2024-02-28T06:09:07Z

This is an WIP PR that make op builder detection adapt to accelerator change. This is followup of #5173
Currently, DeepSpeed generate installed_ops and compatible_ops at setup time. If the system change to a different accelerator at DeepSpeed launch time, these two list would contain incorrect information.

This PR intend to solve this problem with more flexity ops detection.

For installed_ops, DeepSpeed should disable all installed ops if accelerator detected at setup time is different from launch time.
For compatible_ops, DeepSpeed should refresh the list for each launch to avoid impact of accelerator change.

In the first step, nv-inference workflow is temporary change to emulate the scenario that the system is setup with CPU_Accelerator, then launch with CUDA_Accelerator. And CPU_Accelerator is modified to make Intel Extension for PyTorch and oneCCL binding for PyTorch not mandatory.

Starting from here we can reconstruct installed_ops and compatible_ops to follow the design above.

delock · 2024-02-28T06:41:51Z

@tjruwase @mrwyattii @loadams FYI

Note this PR also comes with an updated CPU_Accelerator that could work without Intel Extension for PyTorch and oneCCL Binding for PyTorch. It is designed in a way that is adaptive to whether these two packages are installed. Hope this can help us keep one CPU_Accelerator that serve both basic and optimized user scenario.

delock · 2024-02-29T09:36:51Z

@tjruwase installed_ops and compatible_ops are now reconstructed as this PR description. Also removed Intel Extension for PyTorch and oneCCL binding for PyTorch installation in CPU inference workflow to test whether CPU Accelerator can work with stock PyTorch. Can you help start the workflow? Thanks!

delock · 2024-02-29T16:17:15Z

Hi @tjruwase , the failure in cpu-inference and nv-inference workflow is due to missing package. I have fixed them with latest CI. Can you help start the workflow? Thanks!

delock · 2024-03-01T07:13:57Z

@tjruwase the missing dependency in nv-inference workflow had been added. Can you help restart the workflow? Thanks!

loadams · 2024-03-08T15:58:46Z

The reason for cpu-torch-latest is due to most UTs assume fp16 data type that CPU accelerator does not support. I changed some of the tests to use bf16 data type if accelerator support bf16 but not fp16. It would take a while to make all the UT pass.

@delock, thanks for helping with this. The root cause is the following common construct in the UT:
    "fp16": {"enabled": True}
Rather than skipping these tests, I think the best solution is to modify to
    "fp16": { "enabled": get_accelerator().is_fp16_supported()}
What do you think?
@umchand, can you please try this idea in your environment? Thanks!
FYI, I tried this change suggested by @tjruwase and did see the 4 test which were failing pass.
Thanks! @umchand @tjruwase. I have changed more UTs, can you help start the workflow?

Running now

delock · 2024-03-08T16:44:49Z

@loadams the error in nv-inference is quite strange, preferred_dtype is a variable defined in unit/common.py

Maybe define it as a variable is not a good practice. I changed it to a function instead. Can you help restart the workflow? Thanks!

delock · 2024-03-09T11:14:05Z

@loadams The remaining 4 test failure is due to pdsh command not found. pdsh had not been validated for CPU Accelerator, where impi launcher is recommended for multi-node. Nevertheless, I installed pdsh in this workflow and see what happens. Can you help restart the workflow? Thanks!

delock · 2024-03-10T13:48:29Z

Hi,
@tjruwase @loadams, from the error message, this machine had not been configured for multi-node runner. This machine needs host key verification (ssh with authenticated keys). Otherwise we have to skip this test for CPU accelerator

    def test_user_args(cmd):
        p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        out, err = p.communicate()
>       assert "ARG PARSE SUCCESS" in out.decode("utf-8"), f"User args not parsed correctly: {err.decode('utf-8')}"
E       AssertionError: User args not parsed correctly: localhost: Host key verification failed.

E         pdsh@fv-az843-752: localhost: ssh exited with exit code 255
E         /home/runner/work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
E           warnings.warn(
E

delock · 2024-03-11T03:02:27Z

@tjruwase there is also some UT skipped with CPU accelerator. Each one may need further investigation Will need some help looking at the real reason of the failure and fix them one by one. Should we put this work into seperate PRs and keep these skips with CPU accelerators?

tests/unit/checkpoint/test_zero_optimizer.py

tjruwase · 2024-03-11T12:18:03Z

@tjruwase there is also some UT skipped with CPU accelerator. Each one may need further investigation Will need some help looking at the real reason of the failure and fix them one by one. Should we put this work into seperate PRs and keep these skips with CPU accelerators?

Yes, it is okay to put that work into a separate PR.

delock · 2024-03-11T15:37:43Z

Hi @tjruwase the latest PR should fix nv-torch-latest failrues.
checkpoint_correctness_verification will need some further change to be more clean: remove fp16 parameter and use dtype parameter instead.

loadams · 2024-03-11T15:38:07Z

Hi, @tjruwase @loadams, from the error message, this machine had not been configured for multi-node runner. This machine needs host key verification (ssh with authenticated keys). Otherwise we have to skip this test for CPU accelerator

    def test_user_args(cmd):
        p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        out, err = p.communicate()
>       assert "ARG PARSE SUCCESS" in out.decode("utf-8"), f"User args not parsed correctly: {err.decode('utf-8')}"
E       AssertionError: User args not parsed correctly: localhost: Host key verification failed.

E         pdsh@fv-az843-752: localhost: ssh exited with exit code 255
E         /home/runner/work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
E           warnings.warn(
E

@mrwyattii - what are your thoughts here? But I think that we should probably no install pdsh here and not run those multi-node tests since we don't today?

delock · 2024-03-12T06:50:27Z

A short summary of tests skipped if device_name is "cpu". Total 23 skips in 12 files. We could create a seperate PR, remove these skips, and aim to eliminate all the errors by either fix them in deepspeed, or skip with proper accelerator feature detection.

checkpoint/test_lr_scheduler.py:
26:        if get_accelerator().device_name() == 'cpu':
78:        if get_accelerator().device_name() == 'cpu':
checkpoint/test_zero_optimizer.py:
156:        if zero_stage == 0 and get_accelerator().device_name() == "cpu":
331:        if zero_stage == 0 and get_accelerator().device_name() == "cpu":
checkpoint/test_other_optimizer.py:
75:        if get_accelerator().device_name() == "cpu":
common.py:
88:            assert get_accelerator().device_name() == 'cpu'
launcher/test_user_args.py:
47:    if multi_node and get_accelerator().device_name() == "cpu":
runtime/activation_checkpointing/test_activation_checkpointing.py:
65:    if get_accelerator().device_name() == "cpu":
87:    if get_accelerator().device_name() == "cpu":
runtime/zero/test_zero.py:
94:        if mics_enabled and get_accelerator().device_name() == "cpu":
1320:        if get_accelerator().device_name() == "cpu":
runtime/zero/test_zero_tensor_fragment.py:
147:        if get_accelerator().device_name() == "cpu":
runtime/compile/test_compile_wrapper.py:
75:        if get_accelerator().device_name() == "cpu":
runtime/compile/test_compile_zero.py:
33:        if get_accelerator().device_name() == "cpu":
runtime/compile/test_load_config.py:
77:        if get_accelerator().device_name() == "cpu":
85:        if get_accelerator().device_name() == "cpu":
96:        if get_accelerator().device_name() == "cpu":
104:        if get_accelerator().device_name() == "cpu":
113:        if get_accelerator().device_name() == "cpu":
122:        if get_accelerator().device_name() == "cpu":
runtime/test_ds_config_dict.py:
251:        if get_accelerator().device_name() == "cpu":
runtime/test_data_efficiency.py:
57:        if get_accelerator().device_name() == "cpu":
133:        if get_accelerator().device_name() == "cpu":
178:        if get_accelerator().device_name() == "cpu":

.github/workflows/nv-inference.yml

This is an WIP PR that make op builder detection adapt to accelerator change. This is followup of microsoft#5173 Currently, DeepSpeed generate `installed_ops` and `compatible_ops` at setup time. If the system change to a different accelerator at DeepSpeed launch time, these two list would contain incorrect information. This PR intend to solve this problem with more flexity ops detection. * For `installed_ops`, DeepSpeed should disable all installed ops if accelerator detected at setup time is different from launch time. * For `compatible_ops`, DeepSpeed should refresh the list for each launch to avoid impact of accelerator change. In the first step, nv-inference workflow is temporary change to emulate the scenario that the system is setup with CPU_Accelerator, then launch with CUDA_Accelerator. And CPU_Accelerator is modified to make Intel Extension for PyTorch and oneCCL binding for PyTorch not mandatory. Starting from here we can reconstruct installed_ops and compatible_ops to follow the design above. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

delock added 6 commits February 26, 2024 12:55

update nv-inference.yml for launch time op builder detection validation

238bd1c

change accelerator detection logic

4382ebe

fallback to gloo when oneccl_binding_for_pytorch is not installed

cc21d46

add a workflow to test opbuilder-update

b746322

remove triton from dependenc

a9aab4d

remove opbuild-update and change nv-inference.yml only

53fc44a

delock requested review from mrwyattii and loadams as code owners February 28, 2024 06:09

make installed_ops check accelerator name consistency

e5533ba

delock requested review from awan-10, arashb and tjruwase as code owners February 28, 2024 09:59

Merge branch 'master' into gma/launch_opbuilder_detection

ef01f0d

tjruwase requested review from umchand and removed request for arashb and awan-10 February 28, 2024 20:41

delock added 3 commits February 29, 2024 09:30

fix accelerator override name

22cc43c

fix formatting check

1691ef3

regenerate compatible ops every time

cf2ea66

delock added 2 commits February 29, 2024 17:39

remove ipex and oneccl_pt_binding installation in cpu-inference workflow

c040105

fix cpu-inference and nv-inference workflow

8259122

delock added 4 commits March 1, 2024 00:18

add missing quotation mark

d13fe5c

import ALL_OPS in git_version_info.py

ccaeb72

build oneCCL with parallel make -j

2b6707f

adding missing package dependency

a3bc2f8

fix cpu-inference workflow

04bd061

Merge branch 'master' into gma/launch_opbuilder_detection

ae544e1

change preferred_dtype into a function

4623622

install pdsh in cpu-torch-latest.yml

43505ab

Merge branch 'master' into gma/launch_opbuilder_detection

79c4d6c

delock added 5 commits March 10, 2024 23:06

better test_lr_scheduler skipping

2e59d92

skip multinmode test

ad351e4

preferred_dtype ==> preferred_dtype()

b2673df

fix more tests

41ced03

skip some special case

ad19171

tjruwase reviewed Mar 11, 2024

View reviewed changes

tests/unit/checkpoint/test_zero_optimizer.py Show resolved Hide resolved

tjruwase reviewed Mar 11, 2024

View reviewed changes

tests/unit/checkpoint/test_zero_optimizer.py Show resolved Hide resolved

fix error in nv-torch-latest

f4fe02b

delock added 2 commits March 12, 2024 02:07

fix error in test_zero_context

88567b3

remove "fp16" argument in checkpoint_correctness_verification

c94003b

tjruwase reviewed Mar 12, 2024

View reviewed changes

.github/workflows/nv-inference.yml Show resolved Hide resolved

tjruwase approved these changes Mar 12, 2024

View reviewed changes

tjruwase added this pull request to the merge queue Mar 12, 2024

Merged via the queue into microsoft:master with commit c08e69f Mar 12, 2024
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make op builder detection adapt to accelerator change #5206

Make op builder detection adapt to accelerator change #5206

delock commented Feb 28, 2024

delock commented Feb 28, 2024

delock commented Feb 29, 2024 •

edited

Loading

delock commented Feb 29, 2024

delock commented Mar 1, 2024 •

edited

Loading

loadams commented Mar 8, 2024

delock commented Mar 8, 2024

delock commented Mar 9, 2024

delock commented Mar 10, 2024

delock commented Mar 11, 2024

tjruwase commented Mar 11, 2024

delock commented Mar 11, 2024 •

edited

Loading

loadams commented Mar 11, 2024

delock commented Mar 12, 2024 •

edited

Loading

Make op builder detection adapt to accelerator change #5206

Make op builder detection adapt to accelerator change #5206

Conversation

delock commented Feb 28, 2024

delock commented Feb 28, 2024

delock commented Feb 29, 2024 • edited Loading

delock commented Feb 29, 2024

delock commented Mar 1, 2024 • edited Loading

loadams commented Mar 8, 2024

delock commented Mar 8, 2024

delock commented Mar 9, 2024

delock commented Mar 10, 2024

delock commented Mar 11, 2024

tjruwase commented Mar 11, 2024

delock commented Mar 11, 2024 • edited Loading

loadams commented Mar 11, 2024

delock commented Mar 12, 2024 • edited Loading

delock commented Feb 29, 2024 •

edited

Loading

delock commented Mar 1, 2024 •

edited

Loading

delock commented Mar 11, 2024 •

edited

Loading

delock commented Mar 12, 2024 •

edited

Loading