-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make op builder detection adapt to accelerator change #5206
Make op builder detection adapt to accelerator change #5206
Conversation
@tjruwase @mrwyattii @loadams FYI Note this PR also comes with an updated CPU_Accelerator that could work without Intel Extension for PyTorch and oneCCL Binding for PyTorch. It is designed in a way that is adaptive to whether these two packages are installed. Hope this can help us keep one CPU_Accelerator that serve both basic and optimized user scenario. |
@tjruwase |
Hi @tjruwase , the failure in cpu-inference and nv-inference workflow is due to missing package. I have fixed them with latest CI. Can you help start the workflow? Thanks! |
@tjruwase the missing dependency in nv-inference workflow had been added. Can you help restart the workflow? Thanks! |
Running now |
@loadams the error in nv-inference is quite strange, Maybe define it as a variable is not a good practice. I changed it to a function instead. Can you help restart the workflow? Thanks! |
@loadams The remaining 4 test failure is due to pdsh command not found. pdsh had not been validated for CPU Accelerator, where impi launcher is recommended for multi-node. Nevertheless, I installed pdsh in this workflow and see what happens. Can you help restart the workflow? Thanks! |
Hi,
|
@tjruwase there is also some UT skipped with CPU accelerator. Each one may need further investigation Will need some help looking at the real reason of the failure and fix them one by one. Should we put this work into seperate PRs and keep these skips with CPU accelerators? |
Yes, it is okay to put that work into a separate PR. |
Hi @tjruwase the latest PR should fix nv-torch-latest failrues. |
@mrwyattii - what are your thoughts here? But I think that we should probably no install pdsh here and not run those multi-node tests since we don't today? |
A short summary of tests skipped if device_name is "cpu". Total 23 skips in 12 files. We could create a seperate PR, remove these skips, and aim to eliminate all the errors by either fix them in deepspeed, or skip with proper accelerator feature detection.
|
This is an WIP PR that make op builder detection adapt to accelerator change. This is followup of microsoft#5173 Currently, DeepSpeed generate `installed_ops` and `compatible_ops` at setup time. If the system change to a different accelerator at DeepSpeed launch time, these two list would contain incorrect information. This PR intend to solve this problem with more flexity ops detection. * For `installed_ops`, DeepSpeed should disable all installed ops if accelerator detected at setup time is different from launch time. * For `compatible_ops`, DeepSpeed should refresh the list for each launch to avoid impact of accelerator change. In the first step, nv-inference workflow is temporary change to emulate the scenario that the system is setup with CPU_Accelerator, then launch with CUDA_Accelerator. And CPU_Accelerator is modified to make Intel Extension for PyTorch and oneCCL binding for PyTorch not mandatory. Starting from here we can reconstruct installed_ops and compatible_ops to follow the design above. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This is an WIP PR that make op builder detection adapt to accelerator change. This is followup of microsoft#5173 Currently, DeepSpeed generate `installed_ops` and `compatible_ops` at setup time. If the system change to a different accelerator at DeepSpeed launch time, these two list would contain incorrect information. This PR intend to solve this problem with more flexity ops detection. * For `installed_ops`, DeepSpeed should disable all installed ops if accelerator detected at setup time is different from launch time. * For `compatible_ops`, DeepSpeed should refresh the list for each launch to avoid impact of accelerator change. In the first step, nv-inference workflow is temporary change to emulate the scenario that the system is setup with CPU_Accelerator, then launch with CUDA_Accelerator. And CPU_Accelerator is modified to make Intel Extension for PyTorch and oneCCL binding for PyTorch not mandatory. Starting from here we can reconstruct installed_ops and compatible_ops to follow the design above. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This is an WIP PR that make op builder detection adapt to accelerator change. This is followup of #5173
Currently, DeepSpeed generate
installed_ops
andcompatible_ops
at setup time. If the system change to a different accelerator at DeepSpeed launch time, these two list would contain incorrect information.This PR intend to solve this problem with more flexity ops detection.
installed_ops
, DeepSpeed should disable all installed ops if accelerator detected at setup time is different from launch time.compatible_ops
, DeepSpeed should refresh the list for each launch to avoid impact of accelerator change.In the first step, nv-inference workflow is temporary change to emulate the scenario that the system is setup with CPU_Accelerator, then launch with CUDA_Accelerator. And CPU_Accelerator is modified to make Intel Extension for PyTorch and oneCCL binding for PyTorch not mandatory.
Starting from here we can reconstruct installed_ops and compatible_ops to follow the design above.