Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

float16 does not appear to work on CPU with fp16 capabilities #65

Closed
FlippFuzz opened this issue Mar 22, 2023 · 14 comments
Closed

float16 does not appear to work on CPU with fp16 capabilities #65

FlippFuzz opened this issue Mar 22, 2023 · 14 comments

Comments

@FlippFuzz
Copy link
Contributor

FlippFuzz commented Mar 22, 2023

Convert model

ct2-transformers-converter --model openai/whisper-large-v2 --output_dir whisper-large-v2-ct2-float16 --copy_files tokenizer.json --quantization float16

Run using sample

from faster_whisper import WhisperModel

model_path = "whisper-large-v2-ct2-float16/"

# Run on GPU with FP16
model = WhisperModel(model_path, device="cpu", compute_type="float16")

segments, info = model.transcribe("sample/xxxx.webm", language='ja', task='translate', beam_size=1)

print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
ValueError: Requested float16 compute type, but the target device or backend do not support efficient float16 computation.

This is done on Oracle Cloud's free tier, which has 4x Ampere A1 CPUs and 24G RAM.
The Ampere A1 CPU has native support for FP16.

In WhisperCpp (ggerganov/whisper.cpp#532), I was able to get it to work well with FP16 by adding the necessary compile flags for FP16.
Is there anything similar that we can do here?
FP16 would hopefully significantly improve performance.

@guillaumekln
Copy link
Contributor

Currently we rely on third party libraries to run the matrix multiplications, but none of them support FP16 computation on CPU. (We integrate Intel MKL, oneDNN, OpenBLAS, Ruy, and Apple Accelerate that are selected depending on the platform.)

In the whisper.cpp issue you linked there are indeed gains when using the FP16 model and enabling the relevant FP16 compilation flags. Do you know how it compares to running the FP32 model with OpenBLAS on this CPU?

In faster-whisper, you could try using 8-bit quantization instead, with compute_type="int8".

@FlippFuzz
Copy link
Contributor Author

I don't know how to explicitly select OpenBLAS and am just using the defaults:

python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt

time python3 main.py (main.py is at the start of this issue)

Beam size = 1 for all tests.
Original File is 2m28s long.

First Header Commit Quantization Time
faster-whisper e44a8c7 fp32 10m36.149s
faster-whisper e44a8c7 int8 07m05.425s
WhisperCpp 8e361d9 fp16 04m58.193s

It does look like the lack of fp16 support hurts on this particular model of CPU.
Anyways, I am happy to just close since we already know that the underlying dependencies don't support fp16. Not much we can do here.
Also happy to run any futher simple tests if you want.

@guillaumekln
Copy link
Contributor

Could you enable the verbose mode when running faster-whisper and post the output here?

CT2_VERBOSE=1 time python3 main.py

@FlippFuzz
Copy link
Contributor Author

This is how I've ran Faster-whisper and WhisperCpp.
I will get the CT2_VERBOSE=1 output soon - It's running right now.


Environment

Spin up an always-free free Oracle Cloud instance.

  • Shape: VM.Standard.A1.Flex
  • OCPU count: 4
  • Memory (GB): 24
  • OS: Ubuntu 22.04

Faster-whisper

git clone https://github.com/guillaumekln/faster-whisper.git
cd faster-whisper

python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt
pip3 install -r requirements.conversion.txt

ct2-transformers-converter --model openai/whisper-large-v2 --output_dir whisper-large-v2-ct2-float32 --copy_files tokenizer.json
ct2-transformers-converter --model openai/whisper-large-v2 --output_dir whisper-large-v2-ct2-int8 --copy_files tokenizer.json --quantization int8


tee main.py << 'EOF'
from faster_whisper import WhisperModel

model_path = "whisper-large-v2-ct2-float32/"  # Obviously change the float32 to int8 as needed
model = WhisperModel(model_path, device="cpu", compute_type="float32")  # Obviously change the float32 to int8 as needed

segments, info = model.transcribe("xxxx.webm", language='ja', task='translate', beam_size=1)  # Fill in your file

print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
EOF

time python3 main.py

WhisperCpp

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp/

make CC=gcc-12 CXX=g++-12

# Convert file to wav - This took 0m1.111s.
time ffmpeg -i "xxxx.webm" -ar 16000 -ac 1 -c:a pcm_s16le "xxxx.wav"
  
time ./main --model models/ggml-large.bin --language ja --translate -f xxxx.wav \
--threads 4 --beam-size 1 --best-of 5 \
--word-thold 0.01 --entropy-thold 2.40 --logprob-thold -1.00 \
--print-progress --print-colors \
--output-srt --output-csv --output-file xxxx

@FlippFuzz
Copy link
Contributor Author

[2023-03-23 13:47:34.693] [ctranslate2] [thread 256085] [info] CPU: ARM (NEON=true)
[2023-03-23 13:47:34.693] [ctranslate2] [thread 256085] [info]  - Selected ISA: NEON
[2023-03-23 13:47:34.693] [ctranslate2] [thread 256085] [info]  - Use Intel MKL: false
[2023-03-23 13:47:34.693] [ctranslate2] [thread 256085] [info]  - SGEMM backend: OpenBLAS (packed: false)
[2023-03-23 13:47:34.693] [ctranslate2] [thread 256085] [info]  - GEMM_S16 backend: none (packed: false)
[2023-03-23 13:47:34.693] [ctranslate2] [thread 256085] [info]  - GEMM_S8 backend: Ruy (packed: false, u8s8 preferred: false)
[2023-03-23 13:47:35.387] [ctranslate2] [thread 256085] [info] Loaded model whisper-large-v2-ct2-int8/ on device cpu:0
[2023-03-23 13:47:35.387] [ctranslate2] [thread 256085] [info]  - Binary version: 6
[2023-03-23 13:47:35.387] [ctranslate2] [thread 256085] [info]  - Model specification revision: 3
[2023-03-23 13:47:35.387] [ctranslate2] [thread 256085] [info]  - Selected compute type: int8
Detected language 'ja' with probability 1.000000
...
1412.49user 51.14system 7:25.84elapsed 328%CPU (0avgtext+0avgdata 4990180maxresident)k
0inputs+0outputs (2major+25393022minor)pagefaults 0swaps

@guillaumekln
Copy link
Contributor

Thank you for all the information! Everything looks correct to me.

So it seems this CPU benefits a lot from FP16 and a native compilation.

Can you share what compilation flags are enabled with -match=native? See for example https://stackoverflow.com/a/5470379.

@FlippFuzz
Copy link
Contributor Author

FlippFuzz commented Mar 26, 2023

Here are the flags:

gcc-12 -march=native -Q --help=target
The following options are target specific:
  -mabi=                                lp64
  -march=                               armv8.2-a+crypto+fp16+rcpc+dotprod+ssbs
  -mbig-endian                          [disabled]
  -mbionic                              [disabled]
  -mbranch-protection=
  -mcmodel=                             small
  -mcpu=                                generic
  -mfix-cortex-a53-835769               [enabled]
  -mfix-cortex-a53-843419               [enabled]
  -mgeneral-regs-only                   [disabled]
  -mglibc                               [enabled]
  -mharden-sls=
  -mlittle-endian                       [enabled]
  -mlow-precision-div                   [disabled]
  -mlow-precision-recip-sqrt            [disabled]
  -mlow-precision-sqrt                  [disabled]
  -mmusl                                [disabled]
  -momit-leaf-frame-pointer             [enabled]
  -moutline-atomics                     [enabled]
  -moverride=<string>
  -mpc-relative-literal-loads           [enabled]
  -msign-return-address=                none
  -mstack-protector-guard-offset=
  -mstack-protector-guard-reg=
  -mstack-protector-guard=              global
  -mstrict-align                        [disabled]
  -msve-vector-bits=<number>            scalable
  -mtls-dialect=                        desc
  -mtls-size=                           24
  -mtrack-speculation                   [disabled]
  -mtune=                               generic
  -muclibc                              [disabled]
  -mverbose-cost-dump                   [disabled]

  Known AArch64 ABIs (for use with the -mabi= option):
    ilp32 lp64

  Supported AArch64 return address signing scope (for use with -msign-return-address= option):
    all non-leaf none

  The code model option names for -mcmodel:
    large small tiny

  Valid arguments to -mstack-protector-guard=:
    global sysreg

  The possible SVE vector lengths:
    1024 128 2048 256 512 scalable

  The possible TLS dialects:
    desc trad

Also displaying GCC version if it helps.

gcc-12 -v
Using built-in specs.
COLLECT_GCC=gcc-12
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/12/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 12.1.0-2ubuntu1~22.04' --with-bugurl=file:///usr/share/doc/gcc-12/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-12 --program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release --build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 12.1.0 (Ubuntu 12.1.0-2ubuntu1~22.04)

And here is the CPU

lscpu
Architecture:            aarch64
  CPU op-mode(s):        32-bit, 64-bit
  Byte Order:            Little Endian
CPU(s):                  4
  On-line CPU(s) list:   0-3
Vendor ID:               ARM
  Model name:            Neoverse-N1
    Model:               1
    Thread(s) per core:  1
    Core(s) per cluster: 4
    Socket(s):           -
    Cluster(s):          1
    Stepping:            r3p1
    BogoMIPS:            50.00
    Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-3
Vulnerabilities:
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; __user pointer sanitization
  Spectre v2:            Mitigation; CSV2, BHB
  Srbds:                 Not affected
  Tsx async abort:       Not affected

@FlippFuzz
Copy link
Contributor Author

Perhaps we need to use https://github.com/ARM-software/ComputeLibrary ?
However, this is just a guess. I'm not familiar enough with the underlying code and machine learning in general to know whether that is what is needed or not.

@guillaumekln
Copy link
Contributor

I registered for an Oracle Cloud account and tested on the same instance type that you used.

I did not reproduce your results on a 2 min audio file using the large-v2 model:

Implementation Precision Time
faster-whisper fp32 3m26s
faster-whisper int8 1m58s
whisper.cpp fp16 4m01s

The time for whisper.cpp is consistent with your results, but not the times for faster-whisper.

My guess is that your audio file triggers the "temperature fallback", but the whisper.cpp commit you used (ggerganov/whisper.cpp@8e361d9) just disabled this mode by default. So you should also disable this mode with faster-whisper for the comparison:

model.transcribe(..., temperature=0)

For reference, here are the reported compilation commands for whisper.cpp which include -mcpu=native:

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -mcpu=native   -c ggml.c -o ggml.o           
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread -mcpu=native -c whisper.cpp -o whisper.o     
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread -mcpu=native examples/main/main.cpp examples/common.cpp ggml.o whisper.o -o main 

@FlippFuzz
Copy link
Contributor Author

You are correct.
The difference in performance was caused by the temperature fallback.

The original file is 2m28s long.

First Header Commit Quantization Temperature Time
faster-whisper e44a8c7 fp32 default 10m36.149s
faster-whisper e44a8c7 fp32 0 4m39.499s
faster-whisper e44a8c7 int8 default 07m05.425s
faster-whisper e44a8c7 int8 0 2m32.842s
WhisperCpp 8e361d9 fp16 0 04m58.193s

The result for int8, with 0 temperature is fantastic.
It's almost able to use the large model for real-time computation on this free instance.
If only there was a way to squeeze a little more performance. :)

Looking at the translation for int8 and fp32, int8 is very slightly inferior to fp32, especially in terms of punctuation.
However, it's significantly faster.

fp16 is nice to have because I would expect it to have roughly half the fp32 time which will make it almost real-time too.
However, given the fact that int8 is also pretty good, I guess it's not worth your time to implement for only 1 particular CPU.
Thanks again!

@guillaumekln
Copy link
Contributor

guillaumekln commented Mar 27, 2023

Thanks for the confirmation!

Based on whisper.cpp results, there is indeed a possible x2 speedup with FP16 on this CPU.

Implementation Precision Time
whisper.cpp fp32 8m47s
whisper.cpp (OpenBLAS) fp32 7m44s
whisper.cpp fp16 4m01s

(using the large-v2 model on a 2 min audio file)

@FlippFuzz
Copy link
Contributor Author

I don't think there's anything else we can do here.

Are you OK if I create an enhancement request in ctranslate2 to support fp16 for Arm CPUs and close this off?
It is nice to have (because the free instances would perform really well with fp16), but I totally understand if you don't want to focus on this because it only applies to a small subset of CPUs.

@guillaumekln
Copy link
Contributor

Are you OK if I create an enhancement request in ctranslate2 to support fp16 for Arm CPUs and close this off?

Yes, please do that. Thanks!

@FlippFuzz
Copy link
Contributor Author

Closing since enhancement created in OpenNMT/CTranslate2#1153

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants