OpenVINO and ONNX support for faster CPU execution #208
Replies: 23 comments 38 replies
-
Hi, nice work! I was immediately interested in your result, because I am working on my own custom inference implementation of Whisper for CPU and I am focusing on efficiency and low-memory usage. I just ran the same transcription that you did with the 1h 30min Cramack video and I got the following results:
This is running on a MacBook M1 Pro - CPU only using 8 threads. My implementation is available here: https://github.com/ggerganov/whisper.cpp If you are interested, you can give it a try and see how it performs using your hardware. Edit:
Maybe on Arm there are some extra steps? |
Beta Was this translation helpful? Give feedback.
-
Tried converting onnx models to fp16 OpenVino models. Like this: mo --input_model ../models/openvino/medium/decoder.onnx --data_type FP16 But cant see any improvements on my task, compared to your OpenVino models (10s / 200 s audio test):
|
Beta Was this translation helpful? Give feedback.
-
And can you check your model files on HuggingFace. I tried small model, but has error about shapes - seems its base model. Maybe in HF not appropriate models loaded. |
Beta Was this translation helpful? Give feedback.
-
Additional testing for tiny model. I think,
EncoderTIME torch: 2.716999053955078 TIME onnxruntime: 2.300724983215332 zhuzilin TIME openvino: 1.8365371227264404 zhuzilin TIME deepsparse: 1.354171991348266 zhuzilin DecoderTIME torch: 1.5086073875427246 |
Beta Was this translation helpful? Give feedback.
-
I think that it is the place to be in parallel to test the speed of the encoder and decoder separately. Since it can show a net gain in performance. If we take measurements of the entire process of audio recognition, then everything depends very much on the quality of the audio, how legible the speech is, of the selected TokenDecoder. If you look at the results here #212 , you can see that in some cases, heavier models are processed faster - this is because they almost immediately give a normal result. While a smaller model needs more passes to get a more or less normal result. |
Beta Was this translation helpful? Give feedback.
-
Amazing work, but there seems to be a problem with the models on Huggingface:
|
Beta Was this translation helpful? Give feedback.
-
Great work @zhuzilin! Tagging this here as it seems relevant #227 . Same issue with your code base as in #227 |
Beta Was this translation helpful? Give feedback.
-
Hello, I am trying to create Onnx files myself, could you please guide me how to do that? For example how did you write the Onnx Configuration? Thank you |
Beta Was this translation helpful? Give feedback.
-
@zhuzilin Did you save the ONNX models that you used to convert to OpenVINO anywhere? |
Beta Was this translation helpful? Give feedback.
-
can you change package name from whisper to smth like whispervino? |
Beta Was this translation helpful? Give feedback.
-
Looks very promising for the smaller models to make their use more feasible on lower end CPUs. Some quick numbers from me. Ryzen 5 4500U 6-core laptop CPU (yes clearly not the best option for these workloads but it's what I have). Audio of this YouTube video https://www.youtube.com/watch?v=GFu64hnqzVo, 6m 30s: Original whisper on CPU is 6m19s on tiny.en, 15m39s on base.en, 60m45s on small.en So 1.5x faster on tiny and 2x on base is very helpful indeed. Note: I've found speed of whisper to be quite dependent on the audio file used, so your results may vary. I compared the output files (tiny v tiny and base v base) and they matched exactly. Per above, there's an issue with openvino small.en model so I can't benchmark that yet. |
Beta Was this translation helpful? Give feedback.
-
I'm seeing the same dynamic shape error as others except with openvino 2022.3. Any ideas? 2d [genevera:~/src/ml/ovarm/openvino/build/wheels ⚞2] [openvino-whisper] [2.7.4] master 2s 1 ± pip list
Package Version
------------------ ------------------
certifi 2022.9.24
charset-normalizer 2.1.1
clang 12.0.1
ffmpeg-python 0.2.0
filelock 3.8.0
future 0.18.2
huggingface-hub 0.10.1
idna 3.4
more-itertools 8.14.0
numpy 1.23.1
openvino 2022.3.0
packaging 21.3
Pillow 9.2.0
pip 22.3
pyparsing 3.0.9
PyYAML 6.0
regex 2022.9.13
requests 2.28.1
semantic-version 2.10.0
setuptools 65.5.0
setuptools-rust 1.5.2
tokenizers 0.13.1
torch 1.12.1
torchaudio 0.13.0.dev20221017
torchvision 0.15.0.dev20221017
tqdm 4.64.1
transformers 4.23.1
typing_extensions 4.4.0
urllib3 1.26.12
wheel 0.37.1
whisper 1.0
⌁ 2d [genevera:~/src/ml/ovarm/openvino/build/wheels ⚞2] [openvino-whisper] [2.7.4] master 6s ± whisper --model base.en --language en ./tpn.wav
Traceback (most recent call last):
File "/Users/genevera/.pyenv/versions/openvino-whisper/bin/whisper", line 8, in <module>
sys.exit(cli())
File "/Users/genevera/.pyenv/versions/3.9.13/envs/openvino-whisper/lib/python3.9/site-packages/whisper/transcribe.py", line 283, in cli
model = load_model(model_name)
File "/Users/genevera/.pyenv/versions/3.9.13/envs/openvino-whisper/lib/python3.9/site-packages/whisper/__init__.py", line 104, in load_model
model = Whisper(dims, name)
File "/Users/genevera/.pyenv/versions/3.9.13/envs/openvino-whisper/lib/python3.9/site-packages/whisper/model.py", line 77, in __init__
self.decoder = OpenVinoTextDecoder(model=model)
File "/Users/genevera/.pyenv/versions/3.9.13/envs/openvino-whisper/lib/python3.9/site-packages/whisper/model.py", line 55, in __init__
self.model = self.core.compile_model(self._model, "CPU")
File "/Users/genevera/.pyenv/versions/3.9.13/envs/openvino-whisper/lib/python3.9/site-packages/openvino/runtime/ie_api.py", line 386, in compile_model
super().compile_model(model, device_name, {} if config is None else config),
RuntimeError: get_shape was called on a descriptor::Tensor with dynamic shape |
Beta Was this translation helpful? Give feedback.
-
Hi @zhuzilin, Thanks for this great repo. A small suggestion, in https://github.com/zhuzilin/whisper-openvino/blob/9143b8c0508bc4366583cb941d0dd970f3fc4386/whisper/model.py#L65 I would suggest casting specifically to np.int64 as the defualt int type varies accross OS. Also, I tried to open an issue in the repo, but couldn't find it. I guess the option is disabled. |
Beta Was this translation helpful? Give feedback.
-
The caching for the audio features hasn't been ported properly. Lines 84 to 86 in edb6944 Below is where the keys and values for Lines 81 to 83 in eff383b |
Beta Was this translation helpful? Give feedback.
-
Is it possible to run on a GPU? |
Beta Was this translation helpful? Give feedback.
-
I have try this on a centos server, and I find your implementation will hurt the |
Beta Was this translation helpful? Give feedback.
-
Has anyone tried ONNX support for whisper-large model |
Beta Was this translation helpful? Give feedback.
-
Can someone tell me how to convert a whisper model to an openvino model? |
Beta Was this translation helpful? Give feedback.
-
i have not yet been able to get ONNX or openvino to work. I know for a fact this works great though: https://github.com/ggerganov/whisper.cpp however, it does not yet have an API wrapper, we haev not figured out yet or had the time to do that. |
Beta Was this translation helpful? Give feedback.
-
Getting error on WIN 10 x64 laptop How can i fix it? It is unable to set fp16 argument when calling model.transcribe(...) ? |
Beta Was this translation helpful? Give feedback.
-
Specifying the model load destination to the GPU will result in a runtime error. RuntimeError: cldnn program build failed! [GPU] get_tensor() is called for dynamic shape Does anyone have a solution? Thanks. |
Beta Was this translation helpful? Give feedback.
-
I suggest that you also have a look at We are supporting whisper in sherpa-onnx At present, you can use the code from the above PR to export whisper models to ONNX and use the exported model with onnxruntime in Python for speech recognition. We are adding C++ support to sherpa-onnx |
Beta Was this translation helpful? Give feedback.
-
What are the input nodes and their names? I cannot find a "readme" to the ONNX files |
Beta Was this translation helpful? Give feedback.
-
Hi! Thank you so much for this amazing project!
I've been playing with the model and find them work terrific, even for the "tiny" model, so I transformed the models into openvino and onnx, and organized them in github.com/zhuzilin/whisper-openvino and huggingface hub, so that people could give them a faster try without GPU and make it easier to deploy on client side.
During my project of exporting models into openvino, I got 40% end-to-end time reduction (a 90-minute audio takes 40 minutes now instead of 68 minutes). But I believe there is still large potential in improving the CPU performance of whisper models. Therefore, I'm posting it here so that people can give it a try and help me to make it better :)
Beta Was this translation helpful? Give feedback.
All reactions