GPU Optimization #34

ashissamal · 2021-12-17T13:45:42Z

Thanks for sharing the repo . It is really helpful.

I'm exploring ways to do the optimization on GPU. I know its not presently supported. Could you share some approach or references to implement the optimization on GPU(Nvidia)

Ki6an · 2021-12-18T06:24:19Z

for GPU you can use the onnxruntime-gpu library, but it does not support quantization. so you won't have the advantage of reduced model size during inference.

here's an example implementation of this library for BERT, you can follow this guide and make suitable changes for T5. In addition to this you also need to implement iobinding .
I tried without iobinding but wasn't able to get any advantages over pytorch.

sam-writer · 2021-12-21T01:42:40Z

I would also check out this recent demo that NVIDIA did of TensorRT, which involves converting to ONNX as an intermediate step. They run the tests on GPU https://github.com/NVIDIA/TensorRT/blob/main/demo/HuggingFace/README.md

pommedeterresautee · 2021-12-21T15:06:56Z

ONNX Runtime supports GPU quantization through the TensorRT provider (now embedded by default in the GPU version of the pipy lib, no need for a custom compilation). However it only supports PTQ, meaning there is a 2-3 point accuracy cost (VS QAT or dynamic quantization which are usually close to non quantized accuracy). Quantization brings a X2 speedup, that you can add to a 1.3 speedup when you switch from ORT to TRT, so quite significant on base / large models (not yet benchmarked on distilled models)

Hopefully, QAT is also doable but requires some work per model (modify the attention part to add QDQ). You can see some here ELS-RD/transformer-deploy#29, for now only for Albert, Electra, Bert, Roberta and Distilbert. I will probably add support for Deberta V1 and V2, T5 and Bart as a next step.

nbravulapalli · 2022-01-05T01:25:07Z

I really appreciate the functionality that the fastT5 library offers!

Like the original poster, I am looking to leverage the speedup from both ONNX Runtime and quantization that fastT5 offers, and deploy this on a Nvidia GPU. Do you have any pointers on how to accomplish this with a t5-large model?

@Ki6an or @sam-writer
Are there plans to add GPU support for this library?

Thanks!

sam-writer · 2022-01-05T17:25:28Z

@nbravulapalli yes, there are plans to make running on GPU as easy as running on CPU is currently. However, if you need to run on GPU now, your best bet is probably to follow this notebook to convert the model to TensorRT format, which runs on GPU faster than quantized ONNX t5 runs on CPU.

I know understand @pommedeterresautee's comment. you do not need to convert to TRT format to use TRT. you can convert to ONNX format, then per the onnx docs, you can use the TRT execution provider

import onnxruntime as ort
# set providers to ['TensorrtExecutionProvider', 'CUDAExecutionProvider'] with TensorrtExecutionProvider having the higher priority.
sess = ort.InferenceSession('model.onnx', providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider'])

This might be fast enough because TRT gives a 1.3x speed boost. But if you want the 2x speed boost of quantization, you take an accuracy hit. In fastt5 quantization doesn't hurt accuracy because we use dynamic quantization, which AFAICT isn't an option on GPU yet, you'd be using PTQ instead, which does hurt accuracy.

Another consideration on GPU that isn't a factor on CPU is iobinding, basically that coping values back and forth to the GPU takes time, and should be minimized. Not getting iobinding right can cause a perf hit

sam-writer · 2022-01-12T19:55:23Z

Here is an example from a branch of the ONNX library that demonstrates using io-binding as well as other tricks needed to run on GPU link

GenVr · 2022-03-09T13:47:45Z

@sam-writer So currently to get a performance optimization for inference time on T5 on GPU, do you recommend this code?

Here is an example from a branch of the ONNX library that demonstrates using io-binding as well as other tricks needed to run on GPU link

Is there an example-code of real use?

I would like to improve the GPU inference time with a T5-base with max_length=1024.

Ki6an mentioned this issue Dec 18, 2021

can quantized onnx t5 run on GPU #8

Closed

sam-writer mentioned this issue Jan 14, 2022

Cuda support #37

Closed

Alwin4Zhang mentioned this issue Jan 20, 2022

A problem was encountered exporting an ONNX model with accuracy of FLOAT16 microsoft/onnxruntime#10308

Closed

Ki6an mentioned this issue Mar 9, 2022

GPU support for fastT5 #42

Closed

Ki6an pinned this issue May 20, 2022

Ki6an mentioned this issue May 20, 2022

Not able to run onnx model of allen ai t5 small on GPU #54

Closed

Ki6an added the enhancement New feature or request label May 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Optimization #34

GPU Optimization #34

ashissamal commented Dec 17, 2021

Ki6an commented Dec 18, 2021

sam-writer commented Dec 21, 2021

pommedeterresautee commented Dec 21, 2021 •

edited

Loading

nbravulapalli commented Jan 5, 2022

sam-writer commented Jan 5, 2022 •

edited

Loading

sam-writer commented Jan 12, 2022

GenVr commented Mar 9, 2022

GPU Optimization #34

GPU Optimization #34

Comments

ashissamal commented Dec 17, 2021

Ki6an commented Dec 18, 2021

sam-writer commented Dec 21, 2021

pommedeterresautee commented Dec 21, 2021 • edited Loading

nbravulapalli commented Jan 5, 2022

sam-writer commented Jan 5, 2022 • edited Loading

sam-writer commented Jan 12, 2022

GenVr commented Mar 9, 2022

pommedeterresautee commented Dec 21, 2021 •

edited

Loading

sam-writer commented Jan 5, 2022 •

edited

Loading