-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PoC:clean-room implementation of real-time AI subtitle for English online-TV(OTT TV) #64
Comments
I will start integrating the excellent and amazing whisper.cpp to project KanTV since March 5,2024 after v1.2.9 was released on March 4, 2024 and before that I had been spent about two weeks to migrate some local personal projects to github since Feb 22,2024. background study: GGML is a C library for machine learning (ML): https://github.com/rustformers/llm/blob/main/crates/ggml/README.md Roadmap and FAQ: ggerganov/whisper.cpp#126 Android example app: ggerganov/whisper.cpp#283 whisper.cpp should support NNAPI on Android: ggerganov/whisper.cpp#1249 Android Inference is too slow: ggerganov/whisper.cpp#1070 Use Android NNAPI to accelerate inference on Android Devices: ggerganov/ggml#88 NPU support in whisper.cpp:ggerganov/whisper.cpp#1557 Support for realtime audio input: ggerganov/whisper.cpp#10
silence removal for transcription implemented: ggerganov/whisper.cpp#1649 Can real-time transcription be achieved?: ggerganov/whisper.cpp#1653 How to increase speech to text speed when using whisper cpp?:ggerganov/whisper.cpp#1635 Benchmark results: ggerganov/whisper.cpp#89 Whisper model files in custom ggml format: https://github.com/ggerganov/whisper.cpp/blob/master/models/README.md GGUF file format specification:
Tencent ncnn: https://github.com/Tencent/ncnn updated on 03-13-2024: SIGFPE on certain audio files: Real-time identification of microphone has no result: How to handle real-time sound streams: updated on 03-20-2024, here are some strategies from original author to reduce repetition and hallucinations: |
integrate whisper.cpp to KanTV step1:(just migrate original Android sample in official whisper.cpp to KanTV and study something accordingly) How to practise/play with this branch: adb logcat | grep KANTV |
move to #62 to avoid misunderstanding |
I suddenly got an idea to implement PoC - stage1 after study source code in examples/bench/bench.cpp and examples/main/main.cpp.
If it works well as expected, I'll move to PoC - stage2(whispercpp inference with pre-loadded audio file by another method, referenced with original Android sample and examples/main/main.cpp)
if it works well as expected,I'll merge previous works to master branch and create a new branch/baseline accordingly. then I'll move to PoC-stage3(ASR with Live stream -- aka online English TV). PoC - stage3 would be a real challenge for me so I would breakdown it like PoC-S31/PoC-S32/PoC-S33....accordingly
if it works well as expected,I'll move to PoC-stage4(performance analysis and optimization on Android phone).
the PoC stage-3 and PoC stage-4 might be taken place simultaneously. the final goal is implement real-time English subtitle for English online-TV by KanTV + customized whisper.cpp and I'll demo it on Xiaomi 14(because Xiaomi 14 contains a very powerful mobile SoC and I personally purchased one for purpose of software development). of course, source code of customized whisper.cpp will be found in this Android turn-key project. if it's considered well and accepted by upstream whisper.cpp, I'll submit a PR accordingly. |
it works as expected(PoC stage1 was finished). |
How's the performance? Can this be used for real-time transcription on a reasonably old device? |
Benchmark of whisper.cpp/GGML's mulmat(matrix computing) seems not good on low-end Android phone. But the performance of whisper.cpp/GGML on iOS is very good because the original author of whisper.cpp/GGML(the great Georgi Gerganov) spent much time to optimize them with Apple's dedicated machine learning library(just similar to SSE2/SSE3/AVX optimization on X86 architecture). I think whisper.cpp could be used for real-time subtitle with online TV on high-end Android phone(such as Xiaomi 14, I will demo it later after finish this PoC successfully) and this is the goal of this PoC(this opening issue). Whisper.cpp/GGML might be not reasonable for old device because complicated math computing need powerful SoC or highly optimized code(just like what Georgi did for iOS/Mac platform). @liam-mceneaney has provide an Android example to demo transcription. BTW, the following loop from GGML's official website would be helpful for more information: |
it works as expected(PoC stage2 was finished). the above screenshots can't illustrate any exciting progress in this commit(I'd like to say this is a big milestone and express my sincerely thanks for the great whisper.cpp/GGML again at the moment:I have to say that the more I understand/familiar from whisper.cpp the more feeling I think we all should thanks for the great whisper.cpp/GGML). or built the APK from source code(branch kantv-poc-with-whispercpp) by Android Studio IDE accordingly. |
ASR/transcription performance on Xiaomi 14 is about 5x-20x better then other Android phone(low-end phone from vivo, huawei's honor ------ now it's a standalone company), but it's still not enough for purpose of real-time subtitle with online TV. transcription performance can be improved by about 1-3 seconds when enable openblas(1-3 depend on OS load / process sched / ...... ). performance of mulmat benchmark seems be improved a lot / significantly when enable openblas. so I guess Apple's dedicated machine learning acceleration library mightbe very important for performance on iOS/Mac.just like Georgi Gerganov said before. and we should/might study something about Qualcomm's dedicated/proprietary machine learning acceleration library accordingly. |
updated on 03-10-2024(2024-03-10, 23:41 Beijing Time / GMT + 8): from 21 seconds to 3 seconds, thanks to the powerful Xiaomi 14 or Qualcomm's Snapdragon 8 Gen 3, thanks to the powerful modern compiler from Google. I'd like to say once again at the moment:we all should thanks for the great GGML: the open source C/C++ whisper.cpp & llama.cpp has really changed our world. I think I got the point although the performance of ASR is still not enough for real-time subtitle with online TV. we should maximize the use of the AI engine in Qualcomm's Snapdragon 8 Gen 3 to improve ASR performance more better. |
updated on 03-11-2024(2024-03-11,10:40) Beijing Time / GMT + 8 less than 2 seconds for the first time. commit could be found here. this is exactly one item in the breakdown task list. because now is 2024(not 1994) and we should trust powerful modern compiler from Google and Linaro by top talents in our planet. |
next-step is coding work of PoC-S33 although ASR performance is still not enough(the AI engine in Qualcomm's mobile SoC is not utilized) for real-time subtitle but it's improved a lot now. sincerely thanks for key-point of code snippets of how to transcribe a single audio data by whisper.cpp from @liam-mceneaney:
or from original author of whisper.cpp @ggerganov:
|
clarification of why I said many times that we(programmers) all should thanks for the great GGML:
|
I will submit the method of new optimization(only works on Xiaomi 14) in the next commit accordingly. I don't understand why performance with 4 threads is about 2x than performance with 8 threads(it should be 8 is 2x than 4) by same optimization method. what happened between 4 threads and 8 threads? what's the detail? updated on 03-12-2024(2024-03-12,21:01, Beijing Time / GMT +8): new optimization method (ASR performance less then 1 second, the root cause is because of highly elegant/handcrafted C/C++ implementation of whisper.cpp, of course Google's NDK r26c is powerful and Xiaomi 14 / Qualcomm SM8650-AB Snapdragon 8 Gen 3 is also powerful) for Xiaomi14 could be found in this file or this commit . I think I had been finished coding work of PoC - S33: coding work of data path: UI <----> JNI <----> whisper.cpp <----> kantv-play <----> kantv-core. some snapshots of demo in PoC-S33(third step in PoC stage-3):coding work of data path: UI <----> JNI <----> whisper.cpp <----> kantv-play <----> kantv-core. or built APK from source code with this branch manually, the generated APK should works fine on any mainstream Android phone(because special optimization for Xiaomi 14 is default disabled, of course could be enabled manually in this file. we are getting closer and closer to the final goal of this POC.once again, I'd like to express sincerely thanks for the great whisper.cpp which it's really helpful for C/C++ programmer whom know very little about AI tech. |
updated on 03-15-2024(2024-03-15,11:59, Beijing Time / GMT +8): I spent about 10 days(10+ hours / day with self-motivated) to achieve following goal(and many other minor improvements of this project) since 03-05-2024. it's NOT worked perfectly but more closer and closer to the final goal of this PoC. https://github.com/cdeos/kantv/blob/kantv-poc-with-whispercpp/external/whispercpp/whisper.h#L620 It should be finished in less then or just ONE WEEK as my initial estimation or planning(because I had been spent many days to investigate online TV recording and implemented it perfectly in the end of 2023 and I could re-use many codes for this PoC) . I'm sorry for this. The reason of delay:
updated on 03-22-2024,19:26, anyway, I paid the price and I really have NO negative thoughts of my great country because I think I'm familiar with history of the Ming dynasty and I know that running a large and complex country is NOT easy. BUT, at the same time, I respect the fact:I had been spent about RMB10,000(USD 1500-1600) to fix network issue caused by GFW as a common programmer. so, I will NOT delete above sentence accordingly. |
updated on 03-16-2024(2024-03-16,13:28, Beijing Time / GMT + 8) Finally, I did it(although NOT real "real-time subtitle" and bugfix is required currently) after solve a technical problem in the source code of customized whisper.cpp. Parts of latest source codes could be found at(master branch is preferred for R&D development activity): https://github.com/cdeos/kantv/blob/kantv-poc-with-whispercpp/external/whispercpp/whisper.h#L620 This PoC only works on Xiaomi 14 currently. The reason is that Xiaomi 14 contains a very very very powerful mobile SoC(Qualcomm SM8650-AB Snapdragon 8 Gen 3 (4 nm) and the special build optimization method only works on Xiaomi 14. Unexpected behavior such as ANR(Application Not Responding), .......app crash would be happened on other low-end Qualcomm mobile SoC driven Android phone. kantv-realtime-subtitle-demo-with-whispercpp.mp4The benchmark of ASR performance on Xiaomi 14 with special build optimization is less then 1 second(about 700-800 millisecond) but the performance is about 5-7 secs in complicated real scenario such as TV playback and TV recording and ASR audio recording works at the same time. @ggerganov, The hardware AI engine in Snapdragon 8 Gen 3 should/might/could be utilized in GGML for purpose of real "real-time" English subtitle. I don't know why today's network is stable and Google is available accordingly and Google search is really helpful for this breakthrough progress. anyway, thanks so much.BTW, miniwav (got it by great Google) is also really helpful(during troubleshooting) for this breakthrough progress. @mhroth, Thanks a lot. At the last, I'd like to express my sincerely thanks to the great open source AI project whisper.cpp once again at the moment:without the strength and power by the excellent and amazing whisper.cpp, the above scenario in this PoC/project could not be seen. |
Updated on 03-16-2024,21:19, got a better whispercpp inference performance(from 6 secs to 0.7 sec) in complicated real scenario such as online-TV playback and online-TV transcription and online-TV audio recording works at the same time after fine-tune for Xiaomi 14(commit could be found here). Updated on 03-16-2024,22:36 (Beijing Time / GMT + 8), here is a video of running the whisper.cpp on a Xiaomi 14 device - fully offline, on-device (no Client-Server). realtime-subtitle-by-whispercpp-demo-on-xiaomi14.mp4Updated on 03-17-2024,11:19 (Beijing Time / GMT + 8) FYI: Parts of latest source codes of this PoC could be found at: https://github.com/cdeos/kantv/blob/kantv-poc-with-whispercpp/external/whispercpp/whisper.h#L620 Final outcome of this PoC could be found at kantv-poc-with-whispercpp branch. The master branch is preferred for AI experts or programmers 's R&D development activity since 03-17-2024. BTW:
Roadmap:
|
switch to Project Whispercpp-Android successfully according to roadmap after finsihed PoC #64 this is the new baseline for new Project KanTV(aka Project Whispercpp-Android)
Congratulations to you |
😄 thanks. have fun with the great whisper.cpp(backend by the great OpenAI),this is truly amazing AI technology brought by the great genius programmer @ggerganov. |
whisper.cpp is an open-source and powerful device-side AI framework/lib/model for ASR(Automatic Speech Recognition, a sub-filed of AI).
I want to integrate the great and powerful whisper.cpp to KanTV for purpose of real-time English subtitle with English online TV on Xiaomi 14.
just looks like the following snapshots by Xiaomi 14's powerful proprietary 6B device-side AI model(aka XiaoAI, or Chinese "小爱") + Xiaomi 14's powerful mobile SoC------Qualcomm SM8650-AB Snapdragon 8 Gen 3 (4 nm).
The text was updated successfully, but these errors were encountered: