Whisper WebUI with a VAD for more accurate non-English transcripts (Japanese) #397
Replies: 29 comments 60 replies
-
Hey, that looks like a very nice project that you setup. I also encountered the issues in Whisper while trying to transcribe Japanese (incorrect timings, infinite loops) and wanted to try out your CLI with VAD but I always encounter the following error:
I'm trying to execute your CLI script like this: Any idea what I'm doing wrong? The first thing that I see when I check the error message is that it says |
Beta Was this translation helpful? Give feedback.
-
Hi, is there a reason to prefer the option |
Beta Was this translation helpful? Give feedback.
-
Thank you this seems promising, i tried using it and encountered the following error $ python3 ~/whisper-webui/cli.py --model large --device cuda:0 --task translate --language Japanese --vad silero-vad-skip-gaps ~/x/test.mkv
/home/user/.local/lib/python3.10/site-packages/torch/hub.py:266: UserWarning: You are about to download and run code from an untrusted repository. In a future release, this won't be allowed. To add the repository to your trusted list, change the command to {calling_fn}(..., trust_repo=False) and a command prompt will appear asking for an explicit confirmation of trust, or load(..., trust_repo=True), which will assume that the prompt is to be answered with 'yes'. You can also use load(..., trust_repo='check') which will only prompt for confirmation if the repo is not already trusted. This will eventually be the default behaviour
warnings.warn(
Downloading: "https://github.com/snakers4/silero-vad/zipball/master" to /home/user/.cache/torch/hub/master.zip
Processing VAD in chunk from 00:00.000 to 01:00:00.000
/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1130: UserWarning: operator() profile_node %1178 : int[] = prim::profile_ivalue(%1176)
does not have profile information (Triggered internally at ../torch/csrc/jit/codegen/cuda/graph_fuser.cpp:104.)
return forward_call(*input, **kwargs)
/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1130: UserWarning: concrete shape for linear input & weight are required to decompose into matmul + bias (Triggered internally at ../torch/csrc/jit/codegen/cuda/graph_fuser.cpp:2076.)
return forward_call(*input, **kwargs)
Processing VAD in chunk from 01:00:00.000 to 01:32:35.000 it seems to be working despite the error, however the error persist even after re-running after downloading the silero-vad model any pointers to fix this? the environment in question is wsl2 with cuda |
Beta Was this translation helpful? Give feedback.
-
@aadnk Thank you for this non English improvement! |
Beta Was this translation helpful? Give feedback.
-
It seems like Large-V2 in the most recent version of Whisper is a huge improvement when transcribing Japanese. I tested it on "Macross Frontier - the Movie" as above, and it no longer breaks after 8 minutes: Large-V1 (transcribed at 2022-10-02) - no VAD: Large-V2 (latest version at 2022-12-07): There's still some timing issues after a period of silence, but using a VAD as a workaround may no longer be strictly necessary. |
Beta Was this translation helpful? Give feedback.
-
firstly thank you so much this is what I want! but i wonder i have no code skills etc. i am using google colab and i wonder is there any saving options instead of google drive, i asked it because i do whisper stuff when I go to sleep let the google colab do it, so i dont want to lose any stuff hope it clears do you any idea ? |
Beta Was this translation helpful? Give feedback.
-
Your work is incredible, but as a beginner, I really don't know what went wrong. |
Beta Was this translation helpful? Give feedback.
-
Hey bud, absolutely love using your code to translate my japanese shows, as of today i seem to be having some kind of error, the code executes but i never get a subtitle file or transcript, it worked fine yesterday when used, can you confirm? thank you so much :) |
Beta Was this translation helpful? Give feedback.
-
Hello! I followed all the instructions and I can launch the webui, but when I click on "submit" after uploading the file I want to transcribe I get the following error:
Any idea how to fix it? Python 3.9.12 |
Beta Was this translation helpful? Give feedback.
-
Hello, I have been trying to run your setup with AMD gpu. GPU is detected and the WebUI starts correctly:
My issue is that after filling everything in the UI and starting the Transcription, I get this error:
Complete logs from the python command:
root@sdm:/dockerx/whisper-webui# python app.py --input_audio_max_duration -1 --server_name 127.0.0.1 --auto_parallel True
[Auto parallel] Using GPU devices ['0'] and 8 CPU cores for VAD/transcription.
Running on local URL: http://127.0.0.1:7860
To create a public link, set It seems to attempt using Cuda, and I really don't understand why as this worked for the normal Whisper. |
Beta Was this translation helpful? Give feedback.
-
Hi how can I transcribe english audio and translate to another language? If I choose translate it always outputs in english. |
Beta Was this translation helpful? Give feedback.
-
Thank you so much for creating this and putting it together in such an easy-to-use package. |
Beta Was this translation helpful? Give feedback.
-
Hello @aadnk is there a way to use fine tuned model with your webui? Something like this - https://huggingface.co/clu-ling/whisper-large-v2-japanese-5k-steps |
Beta Was this translation helpful? Give feedback.
-
i want to ask this too
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
Virus-free.www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
…On Wed, 22 Mar 2023 at 06:37, fznx922 ***@***.***> wrote:
also while on this topic i was able to find this model
https://huggingface.co/vumichien/whisper-large-v2-mix-jp which seems like
it had been trained on more steps?
—
Reply to this email directly, view it on GitHub
<#397 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AUOZO4DT5WHNYGB6JF3B4NLW5I3R5ANCNFSM6AAAAAARMAHQRE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
can i use your new code to this .. i mean the code that i can use any model ... |
Beta Was this translation helpful? Give feedback.
-
Hello aadnk, great work on this project. I initially installed it on my computer running Windows 11, and it worked flawlessly. I also tested it on an older hardware setup with CentOS 7 and two K80 GPUs, and it performed admirably. I wanted to inquire about the diarization aspect of the project. Let me explain: Whisper is doing an excellent job at transcribing, and the VAD is efficiently assisting with synchronization. However, when two people speak simultaneously, the transcriptions sometimes become mixed. I came across another project that addresses this issue: https://github.com/MahmoudAshraf97/whisper-diarization/blob/main/diarize.py. Do you think it's possible to combine both projects? If so, when would be the optimal time to implement diarization? For instance, if I apply diarization after the VAD, the results may improve, but I won't be able to "colorize" different transcriptions throughout the entire clip or movie. Thanks in advance! |
Beta Was this translation helpful? Give feedback.
-
Hey Aadnk I've been using your version of faster whisper and was wondering because faster whisper supported word level time stamps if that could improve the transcription quality for japanese? not sure if its possible or if you had tested this already but just a thought i was trying to go about trying it, but currently dont know where to start or if it would conflict with your scripting already, again thank you for all your hard work mate, love your variation on whisper :)! |
Beta Was this translation helpful? Give feedback.
-
Could you add the --word_timestamps option? |
Beta Was this translation helpful? Give feedback.
-
win10 gpu faster-whisper mode When I use it for slightly longer audio with long gap with no speech, dont use VAD, when transcribing, sometimes it gets stuck and just won't move anymore, or it gets stuck for a long time and then ends abruptly. The dialogue after the jam is not recognized, and the timeline for outputting subtitles only goes up to the time of the jam. After getting stuck, nothing is output in srt file For example, this mp3 file:https://cyberfile.me/6h1c ,use for "medium/korean", no vad, it will stuck at a certain time almost evertime. When I use the original whisper to recognize the same audio and sometimes it gets stuck, but it will continue after a while and Never had a situation like the above. |
Beta Was this translation helpful? Give feedback.
-
when i use VAD , sometimes it gets stuck like this. But not every time it gets stuck, the same audio sometimes gets stuck and sometimes doesn't:
|
Beta Was this translation helpful? Give feedback.
-
Today, when i use fast-whisper, it gets stuck like this. I try to change config, but it do not work: Traceback (most recent call last): Sorry, we can't find the page you are looking for. |
Beta Was this translation helpful? Give feedback.
-
Hello, I absolutely love your project! I am currently encountering a problem, I want to generate video subtitles by pasting the URL of the video. But will get |
Beta Was this translation helpful? Give feedback.
-
Hello! Since the GUI can write out the timestamps of when something is being spoken, is there a way for it to use these timestamps to chunk out the sections in the audio file that has speech and save it as an speech_only.wav audio file? |
Beta Was this translation helpful? Give feedback.
-
Hello! I forked code from https://gitlab.com/aadnk/whisper-webui/-/tree/main, and ran it locally in Linux system. With a video link which it could be parsed in the huggingface demo provided by the original author, it seemed that the local service had not taken effect, terminal had no output and UI was always displayed in the calculation process. Have you ever encountered such a situation or did you know how to handle it? Thanks |
Beta Was this translation helpful? Give feedback.
-
@aadnk Thanks for this great tool, I've noticed that when enabling the |
Beta Was this translation helpful? Give feedback.
-
Hey Aadnk :) any chance yet to see what the difference / accuracy in performance in v2 vs v3 model on your application in transcribing japanese? thanks 👍 |
Beta Was this translation helpful? Give feedback.
-
Would it be possible to implement Stable-TS? It's a wrapper on whisper/faster_whisper that has much better timing and grouping of subtitles than default whisper. It also provides some nice helper functions to manipulate the transcription (changing timing, finding/replacing/removing characters/words) (they also have their own functions to write to srt files, and highlight words, similar to what you already have). I've been able to kind of mash your code with theirs by copying your code into my repo and editing the fasterWhisperContainer.py to run Stable-ts instead.
fasterWhisperContainer.py
Update
There are a few caveats I had to make:
|
Beta Was this translation helpful? Give feedback.
-
YOUR whisper on googlecolab have some error |
Beta Was this translation helpful? Give feedback.
-
I've found Whisper to be an incredible free tool for transcribing audio, so I've made my own WebUI which integrates directly with YT-DLP for direct YouTube transcripts, and allows for easy downloads of a transcript or an SRT/VTT file. It also supports more accurate transcripts for languages other than English using a VAD.
There's also support for parallel execution on multiple GPUs, using the
--auto_parallel True
option (see the README for more information):Installation instructions:
You can also use the CLI version, which is identical to the Whisper CLI except that you can also use URL's rather than file paths, and specify a VAD (more about this below). Also note that it's relatively easy to host this WebUI on Google Colab, if you don't have enough GPU horsepower locally to run it yourself.
I've also added support for Docker. You can even download the containers directly from GitLab (see the README for more information):
VAD
Using a VAD is necessary, as unfortunately Whisper suffers from a number of minor and major issues that is particularly apparent when applied to transcribing non-English content - from producing incorrect text (wrong kanji), setting incorrect timings (lagging), to even getting into an infinite loop outputting the same sentence over and over again.
Default Whisper
For instance, when I tried to transcribe the Japanese movie "Macross Frontier - the Movie" i, it got stuck after 00:01:46, endlessly outputting the lines "宇宙に向かう", "アスクワード", "マスコミネットの調査を進めるこの時点で":
I tried using an FFMPEG command to convert the 5 channel audio to better emphasize the center channel with most of the audio dialog, but Whisper still got stuck after 00:08:05, endlessly outputting lines with only the number "2":
However, I was able to avoid some of these issues by manually splitting the original movie into 10 minute chunks, run Whisper in each chunk, and then merge the resulting transcripts together into one long transcript (SRT).
Using Silero VAD
I've been tinkering with my WebUI since the public release of Whisper, and I think I've found a solution using Silero VAD which dramatically improves the accuracy of both the text and timings of long transcripts in Japanese. Just take a look at the transcript for the Macross Frontier movie as an example:
There's still a few repeated lines, but these are hallucinations that occur during silent periods. Other than that, it's actually usable as opposed to just running Whisper on the whole audio.
Essentially, this is done by detecting continuous sections of speech using Silero VAD, then (for performance reasons) merge sections into up to 30 seconds chunks when sections are 5 seconds or less apart. I also pass previous detected text as prompt, if the text is close enough (prompt window is up to 3 seconds by default). Next, I also try to split each chunk such that it includes about 1 second of padding before and after, to ensure that Whisper is properly able to detect words in the beginning and end of each chunk. Finally, Whisper is run on each chunk and the output is automatically merged into one single transcript.
This is enough to mostly completely fix the issues with Japanese text, and I've even been able to run Whisper on 7+ hour videos with no major issues, for instance on this 07:21:20 video by Korone on YouTube:
You can view this transcript directly on YouTube using the addon Substital.
Downsides
The downside is that Whisper might be less accurate when transitioning between each chunk, but in the case of Japanese this is certainly more than worth the trade off when by default Whisper is not able to handle more than a couple of minutes before encountering the issues above. It's also potentially a bit slower.
For English content, however, this trade off may not be worth it, but it could still be depend on the content. I tried using this method on a recent episode of Taskmaster (S14E01), but it didn't seem to improve the timings by much, and it also introduced a few errors during these chunk border (mishearing Dara Ó Briain for instance). Still, it was not noticeably worse or better than regular Whisper.
Beta Was this translation helpful? Give feedback.
All reactions