Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[XPU] XPU accelerator support for Intel GPU device #4547

Merged
merged 36 commits into from
Jan 5, 2024

Conversation

delock
Copy link
Collaborator

@delock delock commented Oct 20, 2023

This PR includes XPU support for Intel GPU. With this PR, DeepSpeed can support XPU devices without install Intel Extension for DeepSpeed.

delock and others added 20 commits September 5, 2023 11:09
* add aio in xpu_upstream

* Update async_io.py deleting private path
* add aio in xpu_upstream

* Update async_io.py deleting private path

* Update async_io.py
* add sylomatic code into upstream

enable jit_load for sycl kernels

* find Python.h using general code

* * add SYCLAutoOpBuilder to support InferenceOpBuilder
* move scripts path to op_builder/xpu

* only change cuda files extension

* delete unused code in inferenceBuilder

* change third-party relative path to enabel python install

* extracty smaller functions from sycl_extension

* change from_blob in source code to avoid big part post processing

* run pre-commit

* add BF16 support

* add license to csrc/xpu code
@delock
Copy link
Collaborator Author

delock commented Oct 30, 2023

SYCLAutoOPBuilder is integrated to convert CUDA kernels into SYCL kernels. Currently transformer inference kernels will be converted automatically during installation time. We are investigating whether we can expand this builder to other kernels so we can reduce SYCL kernel files.

@baodii who is working on SYCLAutoOPBuilder.

delock and others added 4 commits October 30, 2023 07:41
* add sylomatic code into upstream

enable jit_load for sycl kernels

* find Python.h using general code

* * add SYCLAutoOpBuilder to support InferenceOpBuilder
* move scripts path to op_builder/xpu

* only change cuda files extension

* change third-party relative path to enabel python install

* extracty smaller functions from sycl_extension

* change from_blob in source code to avoid big part post processing

* run pre-commit

* add BF16 support

* add other OPBuilder. fused_adam done

* cpu_adam done

* all xpu OpBuilder done, need more test

* delete csrc/xpu

* delete useless files
@delock
Copy link
Collaborator Author

delock commented Nov 8, 2023

@tjruwase With @baodii 's contribution we have SYCLAutoBuilder which converts CUDA kernel into SYCL kernel used by Intel GPU. Now we can remove most manually written SYCL kernel in this PR (only one left and fix is on the way).

@delock delock marked this pull request as ready for review November 8, 2023 06:40
@CaoZhongZ
Copy link
Contributor

oneapi-src/SYCLomatic#1398 Issue for last residual SYCL porting, when it's done we'll fully migrate all kernels. @delock @baodii

baodii and others added 2 commits November 16, 2023 08:32
* fix xpu builder to make install sucessfully

* fix AT_CUDA_CHECK error
baodii and others added 2 commits November 29, 2023 11:03
* * delete SYCLAutoOpBuilder
* add optimizer SYCLOpBuilder
* delete transformer_inference op

* fix format error
@delock
Copy link
Collaborator Author

delock commented Nov 29, 2023

Hi @tjruwase @jeffra, after internal discussion, we removed SyclAutoOpBuilder due to consideration that altough we have validated automatically converted SYCL kernel at this time, if there is new CUDA code change in existing CUDA kernel and these new changes had been automatically converted with SyclAutoOPBuilder, we might have broken functionality and performance in converted SYCL code, if this process is fully automated.

Instead, we prefer to 1) convert CUDA kernels with SyclAutoOpBuilder offline, 2) validate these converted SYCL code 3) upstream the valdiated SYCL kernel. Current SYCL kernels in this PR is the result of this process.

Let us know your thoughts and comments and we can discussion how to go forward. Thanks!

@tjruwase
Copy link
Contributor

tjruwase commented Dec 4, 2023

@delock, thanks for the update. If I understand correctly, you plan to upstream new SYCL kernels via PRs?

@delock
Copy link
Collaborator Author

delock commented Dec 5, 2023

@delock, thanks for the update. If I understand correctly, you plan to upstream new SYCL kernels via PRs?

Hi @tjruwase , it depends. For supporting DeepSpeed OpBuilder we have two methods:

  1. If the functionality in OpBuilder is relative specific to DeepSpeed, we plan to upstream SYCL kernels for these OpBuilder via PRs. Optimizer for DeepSpeed is the set of SYCL kernels we are upstreaming.
  2. If the functionality in OpBuilder is relative generic, we plan to build the kernel inside Intel Extension for PyTorch, and reuse the functionality inside OpBuilder implementation. This is something similiar to NPU's implementation https://github.com/microsoft/DeepSpeed/blob/master/op_builder/npu/fused_adam.py , in this case no SYCL kernel will be upstreamed.

On a big picture, we expect most DeepSpeed feature needs OpBuilder supported through method 2 for XPU, with the intention that the functionality could also be reused elsewhere. We may see method 1 be used in two situations:

  1. If we see the kernel function is DeepSpeed specific.
  2. If there is contribution from other party, being able to implement through SYCLOpBuilder is a more direct way to contribute.

@delock delock requested a review from mrwyattii as a code owner January 3, 2024 07:01
@tjruwase tjruwase requested review from ShadenSmith and tjruwase and removed request for jeffra, cmikeh2, awan-10, RezaYazdaniAminabadi and arashb January 4, 2024 16:45
@tjruwase
Copy link
Contributor

tjruwase commented Jan 4, 2024

@delock, apologies for the delay in reviewing PR. We will prioritize this now.

@mrwyattii
Copy link
Contributor

Thanks @delock this all looks great! Do you have tests that you are running internally to verify this code?

@delock
Copy link
Collaborator Author

delock commented Jan 5, 2024

Thanks @delock this all looks great! Do you have tests that you are running internally to verify this code?

Hi @mrwyattii yes we validate this code on XPU device regulary for inference and training workloads.

@mrwyattii mrwyattii merged commit f4f3131 into microsoft:master Jan 5, 2024
14 checks passed
mauryaavinash95 pushed a commit to mauryaavinash95/DeepSpeed that referenced this pull request Feb 17, 2024
This PR includes XPU support for Intel GPU. With this PR, DeepSpeed can
support XPU devices without install Intel Extension for DeepSpeed.

---------

Co-authored-by: Liangliang-Ma <1906710196@qq.com>
Co-authored-by: baodi <di.bao@intel.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Yizhou Wang <yizhou.wang@intel.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants