Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Optimization #34

Open
ashissamal opened this issue Dec 17, 2021 · 7 comments
Open

GPU Optimization #34

ashissamal opened this issue Dec 17, 2021 · 7 comments
Labels
enhancement New feature or request

Comments

@ashissamal
Copy link

Thanks for sharing the repo . It is really helpful.

I'm exploring ways to do the optimization on GPU. I know its not presently supported. Could you share some approach or references to implement the optimization on GPU(Nvidia)

@Ki6an
Copy link
Owner

Ki6an commented Dec 18, 2021

for GPU you can use the onnxruntime-gpu library, but it does not support quantization. so you won't have the advantage of reduced model size during inference.

here's an example implementation of this library for BERT, you can follow this guide and make suitable changes for T5. In addition to this you also need to implement iobinding .
I tried without iobinding but wasn't able to get any advantages over pytorch.

@sam-writer
Copy link
Contributor

I would also check out this recent demo that NVIDIA did of TensorRT, which involves converting to ONNX as an intermediate step. They run the tests on GPU https://github.com/NVIDIA/TensorRT/blob/main/demo/HuggingFace/README.md

@pommedeterresautee
Copy link

pommedeterresautee commented Dec 21, 2021

ONNX Runtime supports GPU quantization through the TensorRT provider (now embedded by default in the GPU version of the pipy lib, no need for a custom compilation). However it only supports PTQ, meaning there is a 2-3 point accuracy cost (VS QAT or dynamic quantization which are usually close to non quantized accuracy). Quantization brings a X2 speedup, that you can add to a 1.3 speedup when you switch from ORT to TRT, so quite significant on base / large models (not yet benchmarked on distilled models)

Hopefully, QAT is also doable but requires some work per model (modify the attention part to add QDQ). You can see some here ELS-RD/transformer-deploy#29, for now only for Albert, Electra, Bert, Roberta and Distilbert. I will probably add support for Deberta V1 and V2, T5 and Bart as a next step.

@nbravulapalli
Copy link

I really appreciate the functionality that the fastT5 library offers!

Like the original poster, I am looking to leverage the speedup from both ONNX Runtime and quantization that fastT5 offers, and deploy this on a Nvidia GPU. Do you have any pointers on how to accomplish this with a t5-large model?

@Ki6an or @sam-writer
Are there plans to add GPU support for this library?

Thanks!

@sam-writer
Copy link
Contributor

sam-writer commented Jan 5, 2022

@nbravulapalli yes, there are plans to make running on GPU as easy as running on CPU is currently. However, if you need to run on GPU now, your best bet is probably to follow this notebook to convert the model to TensorRT format, which runs on GPU faster than quantized ONNX t5 runs on CPU.

I know understand @pommedeterresautee's comment. you do not need to convert to TRT format to use TRT. you can convert to ONNX format, then per the onnx docs, you can use the TRT execution provider

import onnxruntime as ort
# set providers to ['TensorrtExecutionProvider', 'CUDAExecutionProvider'] with TensorrtExecutionProvider having the higher priority.
sess = ort.InferenceSession('model.onnx', providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider'])

This might be fast enough because TRT gives a 1.3x speed boost. But if you want the 2x speed boost of quantization, you take an accuracy hit. In fastt5 quantization doesn't hurt accuracy because we use dynamic quantization, which AFAICT isn't an option on GPU yet, you'd be using PTQ instead, which does hurt accuracy.

Another consideration on GPU that isn't a factor on CPU is iobinding, basically that coping values back and forth to the GPU takes time, and should be minimized. Not getting iobinding right can cause a perf hit

@sam-writer
Copy link
Contributor

Here is an example from a branch of the ONNX library that demonstrates using io-binding as well as other tricks needed to run on GPU link

@GenVr
Copy link

GenVr commented Mar 9, 2022

@sam-writer So currently to get a performance optimization for inference time on T5 on GPU, do you recommend this code?

Here is an example from a branch of the ONNX library that demonstrates using io-binding as well as other tricks needed to run on GPU link

Is there an example-code of real use?

I would like to improve the GPU inference time with a T5-base with max_length=1024.

@Ki6an Ki6an pinned this issue May 20, 2022
@Ki6an Ki6an added the enhancement New feature or request label May 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants