Whisper WebUI with a VAD for more accurate non-English transcripts (Japanese) #397

aadnk · 2022-10-22T21:36:07Z

aadnk
Oct 22, 2022

I've found Whisper to be an incredible free tool for transcribing audio, so I've made my own WebUI which integrates directly with YT-DLP for direct YouTube transcripts, and allows for easy downloads of a transcript or an SRT/VTT file. It also supports more accurate transcripts for languages other than English using a VAD.

There's also support for parallel execution on multiple GPUs, using the --auto_parallel True option (see the README for more information):

Installation instructions:

Windows (local)

You can also use the CLI version, which is identical to the Whisper CLI except that you can also use URL's rather than file paths, and specify a VAD (more about this below). Also note that it's relatively easy to host this WebUI on Google Colab, if you don't have enough GPU horsepower locally to run it yourself.

I've also added support for Docker. You can even download the containers directly from GitLab (see the README for more information):

sudo docker run -d --gpus all -p 7860:7860 \
--restart=on-failure:15 registry.gitlab.com/aadnk/whisper-webui:latest

VAD

Using a VAD is necessary, as unfortunately Whisper suffers from a number of minor and major issues that is particularly apparent when applied to transcribing non-English content - from producing incorrect text (wrong kanji), setting incorrect timings (lagging), to even getting into an infinite loop outputting the same sentence over and over again.

Default Whisper

For instance, when I tried to transcribe the Japanese movie "Macross Frontier - the Movie" i, it got stuck after 00:01:46, endlessly outputting the lines "宇宙に向かう", "アスクワード", "マスコミネットの調査を進めるこの時点で":

Macross Frontier - Failure Case (SRT)

I tried using an FFMPEG command to convert the 5 channel audio to better emphasize the center channel with most of the audio dialog, but Whisper still got stuck after 00:08:05, endlessly outputting lines with only the number "2":

Macross Frontier - Failure Case - 2C (SRT)

However, I was able to avoid some of these issues by manually splitting the original movie into 10 minute chunks, run Whisper in each chunk, and then merge the resulting transcripts together into one long transcript (SRT).

Using Silero VAD

I've been tinkering with my WebUI since the public release of Whisper, and I think I've found a solution using Silero VAD which dramatically improves the accuracy of both the text and timings of long transcripts in Japanese. Just take a look at the transcript for the Macross Frontier movie as an example:

Macross Frontier - VAD (SRT)

There's still a few repeated lines, but these are hallucinations that occur during silent periods. Other than that, it's actually usable as opposed to just running Whisper on the whole audio.

Essentially, this is done by detecting continuous sections of speech using Silero VAD, then (for performance reasons) merge sections into up to 30 seconds chunks when sections are 5 seconds or less apart. I also pass previous detected text as prompt, if the text is close enough (prompt window is up to 3 seconds by default). Next, I also try to split each chunk such that it includes about 1 second of padding before and after, to ensure that Whisper is properly able to detect words in the beginning and end of each chunk. Finally, Whisper is run on each chunk and the output is automatically merged into one single transcript.

This is enough to mostly completely fix the issues with Japanese text, and I've even been able to run Whisper on 7+ hour videos with no major issues, for instance on this 07:21:20 video by Korone on YouTube:

You can view this transcript directly on YouTube using the addon Substital.

Downsides

The downside is that Whisper might be less accurate when transitioning between each chunk, but in the case of Japanese this is certainly more than worth the trade off when by default Whisper is not able to handle more than a couple of minutes before encountering the issues above. It's also potentially a bit slower.

For English content, however, this trade off may not be worth it, but it could still be depend on the content. I tried using this method on a recent episode of Taskmaster (S14E01), but it didn't seem to improve the timings by much, and it also introduced a few errors during these chunk border (mishearing Dara Ó Briain for instance). Still, it was not noticeably worse or better than regular Whisper.

Shamshiel · 2022-10-23T07:19:43Z

Shamshiel
Oct 23, 2022

Hey, that looks like a very nice project that you setup. I also encountered the issues in Whisper while trying to transcribe Japanese (incorrect timings, infinite loops) and wanted to try out your CLI with VAD but I always encounter the following error:

Running whisper from  00:00.000  to  00:19.610 , duration:  19.61 expanded:  6.388 prompt:  None language:  None
Traceback (most recent call last):
  File "/home/my-user/some-folders/whisper-webui/cli.py", line 110, in <module>
    cli()
  File "/home/my-user/some-folders/whisper-webui/cli.py", line 94, in cli
    result = transcriber.transcribe_file(model, source_path, temperature=temperature, 
  File "/home/my-user/some-folders/whisper-webui/app.py", line 100, in transcribe_file
    result = self.vad_model.transcribe(audio_path, whisperCallable, process_gaps)
  File "/home/my-user/some-folders/whisper-webui/src/vad.py", line 172, in transcribe
    segment_result = whisperCallable(segment_audio, segment_prompt, detected_language)
  File "/home/my-user/some-folders/whisper-webui/app.py", line 93, in <lambda>
    whisperCallable = lambda audio, prompt, detected_language : model.transcribe(audio, \
TypeError: whisper.transcribe.transcribe() got multiple values for keyword argument 'initial_prompt'

I'm trying to execute your CLI script like this:
python3 cli.py --language Japanese --task translate --model medium --vad silero-vad "some-audio-file.wav"

Any idea what I'm doing wrong? The first thing that I see when I check the error message is that it says language: None and that seems wrong. Maybe I'm passing in the arguments wrong? I did it exactly like I did it with the normal Whisper CLI.

1 reply

aadnk Oct 23, 2022
Author

Looks like I forgot to concat the initial_prompt with the running prompt in each VAD segment. This was not a problem in the WebUI, as you can't specify an initial prompt there, but only in the CLI.

I've fixed it now, though. So you can try pulling the repository again.

ivanitlearning · 2022-10-24T13:35:07Z

ivanitlearning
Oct 24, 2022

Hi, is there a reason to prefer the option silero-vad-expand-into-gaps over silero-vad? What does merging non-speech and speech section actually do?

8 replies

dgoryeo Oct 25, 2022

Thanks. I'm going to test it right away.
BTW, have you considered to add a feature to link with Drive for media files and storage of srts?

aadnk Oct 25, 2022
Author

No problem. 😃

BTW, have you considered to add a feature to link with Drive for media files and storage of srts?

Are you using Google Colab? If so, you could do this fairly easily by using the command line directly within the Notebook:

from google.colab import drive
drive.mount('/content/drive')

Then you can access your Google Drive at /content/drive. For instance, if you want to transcribe a file at AI/Test/out.mka in your own Drive, you can use the following command:

!cd /content/drive/MyDrive/AI/Test && python /content/whisper-webui/cli.py --language Japanese out.mka

This will save the SRT and assorted files next to "out.mka".

dgoryeo Oct 25, 2022

Thanks! Very helpful.
I tried the latest version with a 2hrs+ movie, and it worked quite well. Of course as expected the timings with Whisper are off --I used Large model which is more notorious for miss-timing.

aadnk Oct 26, 2022
Author

I used Large model which is more notorious for miss-timing.

It is, though using a VAD should reduce this timing problem at a minor cost to accuracy. You can set a VAD in the command line by using the --vad option:

python /content/whisper-webui/cli.py --language Japanese --vad silero-vad out.mka

dgoryeo Oct 26, 2022

I used silero-vad-expand-into-gaps with the large model. The timing is off especially in the areas when a music, or a background sound starts before dialogue. The sub time starts from the start of the background sound. At least that is my observation so far.

arabcoders · 2022-10-26T09:53:14Z

arabcoders
Oct 26, 2022

Thank you this seems promising, i tried using it and encountered the following error

$ python3 ~/whisper-webui/cli.py --model large --device cuda:0 --task translate --language Japanese --vad silero-vad-skip-gaps ~/x/test.mkv
/home/user/.local/lib/python3.10/site-packages/torch/hub.py:266: UserWarning: You are about to download and run code from an untrusted repository. In a future release, this won't be allowed. To add the repository to your trusted list, change the command to {calling_fn}(..., trust_repo=False) and a command prompt will appear asking for an explicit confirmation of trust, or load(..., trust_repo=True), which will assume that the prompt is to be answered with 'yes'. You can also use load(..., trust_repo='check') which will only prompt for confirmation if the repo is not already trusted. This will eventually be the default behaviour
  warnings.warn(
Downloading: "https://github.com/snakers4/silero-vad/zipball/master" to /home/user/.cache/torch/hub/master.zip
Processing VAD in chunk from 00:00.000 to 01:00:00.000
/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1130: UserWarning: operator() profile_node %1178 : int[] = prim::profile_ivalue(%1176)
 does not have profile information (Triggered internally at  ../torch/csrc/jit/codegen/cuda/graph_fuser.cpp:104.)
  return forward_call(*input, **kwargs)
/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1130: UserWarning: concrete shape for linear input & weight are required to decompose into matmul + bias (Triggered internally at  ../torch/csrc/jit/codegen/cuda/graph_fuser.cpp:2076.)
  return forward_call(*input, **kwargs)
Processing VAD in chunk from 01:00:00.000 to 01:32:35.000

it seems to be working despite the error, however the error persist even after re-running after downloading the silero-vad model any pointers to fix this?

the environment in question is wsl2 with cuda

5 replies

aadnk Oct 27, 2022
Author

You could add "trust_repo=True" to the load method in the constructor of VadSileroTranscription. This is at line 380 in vad.py in the current version. That will implicitly trust snakers4/silero-vad, and it should prevent the warning from being displayed in the console:

self.model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', model='silero_vad', trust_repo=True)

I was thinking of doing this myself, but the problem was that using "trust_repo" requires torch v1.12 or later, and I wanted to support older versions of Torch. I tried checking the Torch version dynamically, but that didn't seem to work as expected. But I'll probably look into it again before v1.14 is released and "trust_repo" becomes required.

arabcoders Oct 27, 2022

Thanks, I am mainly concerned about this part though

Processing VAD in chunk from 00:00.000 to 01:00:00.000
/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1130: UserWarning: operator() profile_node %1178 : int[] = prim::profile_ivalue(%1176)
 does not have profile information (Triggered internally at  ../torch/csrc/jit/codegen/cuda/graph_fuser.cpp:104.)
  return forward_call(*input, **kwargs)
/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1130: UserWarning: concrete shape for linear input & weight are required to decompose into matmul + bias (Triggered internally at  ../torch/csrc/jit/codegen/cuda/graph_fuser.cpp:2076.)
  return forward_call(*input, **kwargs)
Processing VAD in chunk from 01:00:00.000 to 01:32:35.000

while i can read python code this part is just doesn't click for me :-(

arabcoders Nov 20, 2022

@aadnk Do you have any idea on why this sometimes happens? after running the vad i rerun custom translation on top of the srt and noticed this results.

[test.srt] T => [私はもう一生忘れない。 I will never forget.] => [I will never forget it. I will never forget.]
[test.srt] T => [あんなに楽しくパラディオンを歌えるんだなって思ったし I thought I could sing the song so well.] => [I thought I could sing the song so well.]
[test.srt] T => [みんなのなんかさ、団結力じゃないけど I thought the song was not a unity of everyone,] => [I thought the song was not a unity of everyone,]
[test.srt] T => [あれのアーチでさ、作るんだじゃん。 but it was made by ARCH.] => [I'm going to make it with that arch. but it was made by ARCH.]
[test.srt] T => [あれもさ、アドリブっていうか、何も考えるようにやってるじゃん。 That was also an ad-lib. I thought I could do anything without the ad-lib.] => [That is also ad-lib, or rather, he is doing it as if he thinks about nothing. That was also an ad-lib.]
[test.srt] T => [上手なしでさ、あんなに意気あるんだなって。 I was so happy to do it without the ad-lib.] => [I'm not good at it, but I'm so determined. I was so happy to do it without the ad-lib.]
[test.srt] T => [本当になんかメンバー大好きと思った。 I thought I really loved the members.] => [I really liked the members. I thought I really loved the members.]
[test.srt] T => [上手いと思った。 I thought so.] => [I thought I was good at it. I thought so.]
[test.srt] T => [でもあれ、すごい良かったよね。 But that was really good.] => [But anyway, it was great. But that was really good.]
[test.srt] T => [もし自分が卒業するんだったら、あいう風に送り出されたよ。 If I was going to graduate, I would have been sent out like that.] => [If I was to graduate, I would have been sent off like that. If I was going to graduate, I would have been sent out like that.]
[test.srt] T => [泣いちゃうな。 I'm going to cry.] => [Don't cry I'm going to cry.]
[test.srt] T => [泣き腫らした。 I'm going to cry.] => [I burst into tears. I'm going to cry.]
[test.srt] T => [もう卒業しないね、みんな。 I'm not going to graduate, everyone.] => [I'm not graduating anymore, guys. I'm not going to graduate, everyone.]
[test.srt] T => [本当寂しいから、やめて。 I'm so lonely. Stop it.] => [I'm really lonely, so please stop. I'm so lonely. Stop it.]
[test.srt] T => [どうしどし、昨日のコンサートに来てくださった方は If you came to our concert yesterday,] => [If you came to our concert yesterday,]
[test.srt] T => [ライブの感想とかも、良かったら送ってください。 please send us your impressions of the concert.] => [Please send us your impressions of the live performance if you like. Please send us your impressions of the concert.]
[test.srt] T => [今日は、ライブの裏側の写真をみんなに祝いしてもらっているので、見ていきたいと思いますが、 First of all, I'd like to show you the behind-the-scenes photos of the concert.] => [Today, everyone is celebrating behind-the-scenes photos of the concert, so I'd like to see them. First of all, I'd like to show you the behind-the-scenes photos of the concert.]
[test.srt] T => [じゃあ、まずは、マリカはあるんですかね? First of all, where is Marika?] => [First of all, where is Marika?]
[test.srt] T => [あります。 Here she is.] => [I have. Here she is.]
[test.srt] T => [これは、本当に最後の千秋楽に This is the last one.] => [This is the last one.]
[test.srt] T => [メンバーとスタッフさんみんなで集まって挨拶したときに When the members and the staff gathered and greeted each other,] => [When the members and the staff gathered and greeted each other,]
[test.srt] T => [スタッフさんがサプライズで毎々花束を渡したんですけど The staff gave each of them a bouquet of flowers as a surprise.] => [The staff gave each of them a bouquet of flowers as a surprise.]
[test.srt] T => [その、バラの花束にちょっとだけいっぱい…なんていうんですか? What do you call it?] => [What do you call it?]
[test.srt] T => [ハルジオンがいっぱい刺さってて、凄い素敵な花束。 There were a lot of flowers in the bouquet of roses.] => [A very nice bouquet with a lot of halcyon stuck in it. There were a lot of flowers in the bouquet of roses.]
[test.srt] T => [でも、またね、ここで毎々泣いちゃって。 But again, I cried here every time.] => [But again, I cried here every time. But again, I cried here every time.]
[test.srt] T => [昨日、何回泣いたんだろうね。 I don't know how many times I cried yesterday.] => [How many times did you cry yesterday? I don't know how many times I cried yesterday.]
[test.srt] T => [私らも何回も泣いたもんね。 We cried many times, too.] => [We cried too many times. We cried many times, too.]
[test.srt] T => [みんなの見えてない袖とかでも、みんな大好きで。 Everyone's favorite is the one they can't see.] => [Even the sleeves that everyone can't see, everyone loves them. Everyone's favorite is the one they can't see.]
[test.srt] T => [あと、髪の時点で、髪が一番見たいって言ったね。 At that time, I wanted to see my hair the most.] => [Also, at the time of hair, you said that you want to see hair the most. At that time, I wanted to see my hair the most.]
[test.srt] T => [毎々が強がるつぼみ歌ってる時も、全員裏でも、コニターで見ててね。 When MaiMai was singing the song, everyone was looking at the monitor.] => [Even when Tsubomi is singing, everyone is behind the scenes, please watch it on the console. When MaiMai was singing the song, everyone was looking at the monitor.]
[test.srt] T => [大袖の人だよね。役割は出派系とかだよね。 I think they were from Daiso. I think they were from Dehake.] => [You're a big-sleeved person. The role is something like a dispatch system. I think they were from Daiso. I think they were from Dehake.]
[test.srt] T => [続いての裏側はなんですかね。 What's the next page?] => [What is the next behind the scenes? What's the next page?]
[test.srt] T => [こちらです。 Oh, Mai-chan.] => [Here it is. Oh, Mai-chan.]
[test.srt] T => [なんだこれ。 What's this?] => [what's this. What's this?]
[test.srt] T => [かわいい。 Cute.] => [cute. Cute.]
[test.srt] T => [これね、音声ないのが欲しいんだけど、ライド始まるよって言ってんの。 I don't want you to hear this, but it says, the show is about to start.] => [I want this one without sound, but it says the ride is about to start. I don't want you to hear this, but it says, the show is about to start.]
[test.srt] T => [テンション高いね、始まる前の。 You're so excited.] => [The tension is high, before it starts. You're so excited.]
[test.srt] T => [いや、なんか、最初は写真撮りに行こうと思って、写真をこうやって撮ろうと思ってやってたら、なんか、動画 撮りたいねってなって、 I was thinking of taking pictures at first, but I wanted to take a video.] => [No, at first I was thinking of taking pictures, and when I was thinking about taking pictures like this, I was thinking of taking pictures at first, but I wanted to take a video.]
[test.srt] T => [じゃ、動画撮るね。行くよって言ったら、こうして出したから、私もこうやってヘッドバンドにしてるから。 So, I said, I'm going to take a video, and I posted it, so I'm wearing a headband like this.] => [I'll take a video then. When I said I'm going, I put it out like this, so I'm wearing a headband like this too.  So, I said, I'm going to take a video, and I posted it, so I'm wearing a headband like this.]
[test.srt] T => [毎回、ちゃんとピンと合ってるものがなかったからね。 Every time, there wasn't anything that was in focus.] => [Every time, it just didn't quite fit. Every time, there wasn't anything that was in focus.]
[test.srt] T => [毎回かどうか、ちょっと分かんないよ。 Every time, I don't know if it's every time or not.] => [I don't know if it happens every time. Every time, I don't know if it's every time or not.]
[test.srt] T => [お茶目だよね、本当に。 It's a joke, really.] => [You're mischievous, really. It's a joke, really.]
[test.srt] T => [可愛い、可愛い、可愛い。 Cute, cute.] => [Cute, cute, cute. Cute, cute.]
[test.srt] T => [最近さメンバーがあの何スーパーで売ってそうなフルーツそのまま持ってくるの流行ってない? フルーツとかトマトとか。] => [Recently, isn't it popular for the members to bring fruits that are likely to be sold at any supermarket? Fruits and tomatoes.]
[test.srt] T => [マイちゃん結構前からね。 フルーツトマト。] => [It's been a long time, Mai-chan. fruit tomato.]
[test.srt] T => [私も。 結構フルーツトマト流行ってるよね。] => [Me too. Fruit tomatoes are very popular.]
[test.srt] T => [流行ってる。 アメとかも。] => [popular. Also candy.]
[test.srt] T => [いろんな種類のさ、トマト。 甘い美味しいトマト。] => [All kinds of tomatoes. Sweet delicious tomatoes.]
[test.srt] T => [細長いトマト美味しくない? 美味しい。] => [Aren't long thin tomatoes delicious? Delicious.]
[test.srt] T => [ちょっと高いんだよ。 細長い。] => [It's a little expensive. Elongated.]
[test.srt] T => [ちょっと頑張って買っちゃうので食べたい。] => [I'll do my best to buy it, so I want to eat it.]
[test.srt] T => [なんかある? みんな。] => [Anything? Guys.]
[test.srt] T => [私、私じゃないんだけどね。 最近お母さんが。] => [I'm not me My mother recently.]
[test.srt] T => [お母さんがね、最近シャミセンを始めたらしい。] => [It seems that my mother recently started shamisen.]
[test.srt] T => [えー! えーじゃない?] => [Eh! Isn't it?]
[test.srt] T => [ういー!] => [Wow!]
[test.srt] T => [なんか分かんないんだよね。] => [I just don't understand.]
[test.srt] T => [なんか多分やりたいなって思って始めたらしいんだけど。] => [It seems that he started thinking that he probably wanted to do something.]
[test.srt] T => [すごい、素敵。] => [wow, nice]
[test.srt] T => [そう、なんかね、何度もね、すぐやりたがっちゃう人なんですよ。] => [Yes, somehow, many times, I'm a person who wants to do it right away.]
[test.srt] T => [だからなんか。] => [That's why.]
[test.srt] T => [えー! えー! えー!] => [Eh! Eh! Eh!]
[test.srt] T => [今日はこれやってきたよとか、 今日はこれやってきたよって毎日教える。 もうそんな違うんだって。] => [I tell them everyday that they have done this today, or that they have done this today. It's so different.]
[test.srt] T => [ねー、でもいいね。 なんか。] => [Hey, but that's fine. Something.]
[test.srt] T => [シェーニャンもやってみたら?] => [Why don't you try Shaenyan too?]
[test.srt] T => [いやー、いいよー。] => [No, that's fine.]
[test.srt] T => [ちょっと待って、何も使えなくない?] => [Wait a minute, nothing works?]
[test.srt] T => [ライブとかの曲でチャミスとか?] => [Chamis in songs like live performances?]
[test.srt] T => [だって今、ケーキのギターの子とかいるじゃん。] => ['Cause right now, there's a girl with a cake guitar.]
[test.srt] T => [ギターってかっこいいじゃん。] => [Guitars are cool.]
[test.srt] T => [チャミス、チャミス何ができるの?] => [Chamis, Chamis what can you do?]
[test.srt] T => [めっちゃみんなかっこいい!] => [Everyone is so cool!]
[test.srt] T => [あのチャミスもすごいかっこよく弾く。] => [That chamis also plays really cool.]
[test.srt] T => [めっちゃかっこいい。] => [It's so cool.]
[test.srt] T => [めっちゃね。] => [It's crazy.]
[test.srt] T => [いや、かっこいいよ。やろう。] => [No, it's cool. let's do it.]
[test.srt] T => [なんでそんな適当なの?] => [why is it so appropriate?]
[test.srt] T => [ノリノリ。] => [Excited.]
[test.srt] T => [なんかノリ弾的なやつでさ、あのー。] => [He's kind of a go-getter, huh.]
[test.srt] T => [えー、なんかいる。] => [Eh, there is something.]
[test.srt] T => [コトできる、コト。] => [I can do things, things.]
[test.srt] T => [コトできんの?] => [Can you do it?]
[test.srt] T => [コトできるの? 学ってた?] => [Can you do it? Did you learn?]
[test.srt] T => [いるじゃん、コト。] => [You're here, Koto.]
[test.srt] T => [やらない?] => [won't you?]
[test.srt] T => [シャカー中担当誰にする?] => [Who will be in charge during Shakar?]
[test.srt] T => [これはリコーダーかな?] => [is this a recorder?]
[test.srt] T => [違う!] => [Wrong!]
[test.srt] T => [あるよ。] => [There is]
[test.srt] T => [大声、大声。] => [Loud, loud.]
[test.srt] T => [最近占いとか行った? Have you gone to fortune-telling recently?] => [Have you gone to fortune-telling recently?]
[test.srt] T => [最近は行かないと。 I haven't gone to fortune-telling recently.] => [I have to go recently. I haven't gone to fortune-telling recently.]
[test.srt] T => [私それで一回違う先生に占ってもらった時にちょっとあまり良いこと言われなかったの。だから絶対違うと思っ て受付行って、もう一人ちょっと違う先生に占ってもらった。] => [So when I had a different teacher tell me once, he didn't say very good things. That's why I thought it was definitely wrong , so I went to the reception and had another teacher read it for me.]
[test.srt] T => [せっかく占ってもらってもらってもらってもらったけど、絶対自転手自転手って。] => [I was asked to get a fortune-telling, but it's definitely a bicycle hand.]
[test.srt] T => [嘘! マジで? 違うじゃないよ!] => [Lies! Seriously? That's not right!]
[test.srt] T => [でも本当に似たようなこと言われて。] => [But I was told something very similar.]
[test.srt] T => [Before you start studying, listen to オーバーチャ once.] => [Before you start studying, listen to overcha once.]

so times the result is mixs of JPN and ENG text jumbled togather, some of those lines are double for example

[test.srt] T => [私はもう一生忘れない。 I will never forget.] => [I will never forget it. I will never forget.]

it seems the transcribe translated the text correctly, but it fed back to the text thus resulting into the problem, i have no idea why this is happening. Also as you may have noticed the VAD for some reason generate much higher rate of untranslated text compared to whisper without VAD i am using the silvero-vad

aadnk Nov 23, 2022
Author

@arabcoders: How short are the segments that are passed to Whisper, by the way?

In the default Web-UI, you can see this displayed in the console output under "Processing timestamps", for instance:

Processing timestamps:
[{'end': 17.184, 'expand_amount': 0.0, 'start': 0.0},
 {'end': 39.92, 'expand_amount': 0.0, 'start': 17.184},
 {'end': 362.256, 'expand_amount': 0.0, 'start': 39.92},
 {'end': 418.976, 'expand_amount': 0.0, 'start': 362.256},
 {'end': 422.03200000000004, 'expand_amount': 0.0, 'start': 418.976},
 {'end': 480.624, 'expand_amount': 0.0, 'start': 422.03200000000004},
 {'end': 586.32, 'expand_amount': 0.0, 'start': 480.624}]
Running whisper from  00:00.000  to  00:17.184 , duration:  17.184 expanded:  0.0 prompt:  None language:  None
...

Perhaps Whisper is hallucinating English in the transcription when the duration is fairly short?

If that's the case, then I'd try increasing "VAD - Max Merge Size (s)" and "VAD - Merge Window (s)" (vad_max_merge_size and vad_merge_window in the CLI). That will increase the duration of the segments that are passed to Whisper, which should hopefully reduce the frequency of the issue (given that the non-VAD version is less prone to it).

But yeah, if you're still having issues, I'd recommend creating a short program that reproduces the translation problem along with a short sample audio, and making an issue here on GitHub. Perhaps someone at OpenAI or another contributor might have some ideas.

arabcoders Nov 23, 2022

I see here is a fairly complete sample if you have any other pointers

Timestamps:
[{'end': 7.680000000000001, 'start': 0.0},
 {'end': 210.224, 'start': 7.680000000000001},
 {'end': 255.248, 'start': 210.224},
 {'end': 315.82399999999996, 'start': 255.248},
 {'end': 350.91200000000003, 'start': 315.82399999999996},
 {'end': 361.664, 'start': 350.91200000000003},
 {'end': 397.44, 'start': 361.664},
 {'end': 403.536, 'start': 397.44},
 {'end': 469.48800000000006, 'start': 403.536},
 {'end': 490.112, 'start': 469.48800000000006},
 {'end': 522.352, 'start': 490.112},
 {'end': 523.344, 'start': 522.352},
 {'end': 596.624, 'start': 523.344},
 {'end': 599.904, 'start': 596.624},
 {'end': 631.296, 'start': 599.904},
 {'end': 679.136, 'start': 631.296},
 {'end': 690.128, 'start': 679.136},
 {'end': 737.248, 'start': 690.128},
 {'end': 756.7360000000001, 'start': 737.248},
 {'end': 787.072, 'start': 756.7360000000001},
 {'end': 844.4000000000001, 'start': 787.072},
 {'end': 860.384, 'start': 844.4000000000001},
 {'end': 919.136, 'start': 860.384},
 {'end': 993.28, 'start': 919.136},
 {'end': 1070.4160000000002, 'start': 993.28},
 {'end': 1145.888, 'start': 1070.4160000000002},
 {'end': 1159.232, 'start': 1145.888},
 {'end': 1193.6480000000001, 'start': 1159.232},
 {'end': 1222.752, 'start': 1193.6480000000001},
 {'end': 1291.984, 'start': 1222.752},
 {'end': 1296.944, 'start': 1291.984},
 {'end': 1461.8400000000001, 'start': 1296.944},
 {'end': 1470.824, 'start': 1461.8400000000001}]
Transcribing non-speech:
[{'end': 7.680000000000001, 'expand_amount': 0.0, 'start': 0.0},
 {'end': 210.224, 'expand_amount': 0.0, 'start': 7.680000000000001},
 {'end': 255.248, 'expand_amount': 0.0, 'start': 210.224},
 {'end': 315.82399999999996, 'expand_amount': 0.0, 'start': 255.248},
 {'end': 350.91200000000003, 'expand_amount': 0.0, 'start': 315.82399999999996},
 {'end': 361.664, 'expand_amount': 0.0, 'start': 350.91200000000003},
 {'end': 397.44, 'expand_amount': 0.0, 'start': 361.664},
 {'end': 403.536, 'expand_amount': 0.0, 'start': 397.44},
 {'end': 469.48800000000006, 'expand_amount': 0.0, 'start': 403.536},
 {'end': 490.112, 'expand_amount': 0.0, 'start': 469.48800000000006},
 {'end': 522.352, 'expand_amount': 0.0, 'start': 490.112},
 {'end': 523.344, 'expand_amount': 0.0, 'start': 522.352},
 {'end': 596.624, 'expand_amount': 0.0, 'start': 523.344},
 {'end': 599.904, 'expand_amount': 0.0, 'start': 596.624},
 {'end': 631.296, 'expand_amount': 0.0, 'start': 599.904},
 {'end': 679.136, 'expand_amount': 0.0, 'start': 631.296},
 {'end': 690.128, 'expand_amount': 0.0, 'start': 679.136},
 {'end': 737.248, 'expand_amount': 0.0, 'start': 690.128},
 {'end': 756.7360000000001, 'expand_amount': 0.0, 'start': 737.248},
 {'end': 787.072, 'expand_amount': 0.0, 'start': 756.7360000000001},
 {'end': 844.4000000000001, 'expand_amount': 0.0, 'start': 787.072},
 {'end': 860.384, 'expand_amount': 0.0, 'start': 844.4000000000001},
 {'end': 919.136, 'expand_amount': 0.0, 'start': 860.384},
 {'end': 993.28, 'expand_amount': 0.0, 'start': 919.136},
 {'end': 1070.4160000000002, 'expand_amount': 0.0, 'start': 993.28},
 {'end': 1145.888, 'expand_amount': 0.0, 'start': 1070.4160000000002},
 {'end': 1159.232, 'expand_amount': 0.0, 'start': 1145.888},
 {'end': 1193.6480000000001, 'expand_amount': 0.0, 'start': 1159.232},
 {'end': 1222.752, 'expand_amount': 0.0, 'start': 1193.6480000000001},
 {'end': 1291.984, 'expand_amount': 0.0, 'start': 1222.752},
 {'end': 1296.944, 'expand_amount': 0.0, 'start': 1291.984},
 {'end': 1461.8400000000001, 'expand_amount': 0.0, 'start': 1296.944},
 {'end': 1470.824, 'start': 1461.8400000000001}]
[07:12:54] Transcribe [00:00.000 => 00:07.680] Duration: [7.68] Expanded: [0.0] Prompt: [None] Language: [None]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 750/750 [00:01<00:00, 499.81frames/s]
[07:12:56] Transcribe [00:07.680 => 03:30.224] Duration: [202.544] Expanded: [0.0] Prompt: [Please look forward to the next video!] Language: [Japanese]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20254/20254 [00:30<00:00, 673.51frames/s]
[07:13:26] Transcribe [03:30.224 => 04:15.248] Duration: [45.024] Expanded: [0.0] Prompt: [The members who usually shoot in Fukuoka.] Language: [Japanese]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4502/4502 [00:05<00:00, 806.16frames/s]
[07:13:32] Transcribe [04:15.248 => 05:15.824] Duration: [60.576] Expanded: [0.0] Prompt: [39-year-old Goto Terumoto] Language: [Japanese]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6057/6057 [00:08<00:00, 705.25frames/s]
[07:13:40] Transcribe [05:15.824 => 05:50.912] Duration: [35.088] Expanded: [0.0] Prompt: [Let's make the tourists of Tokyo Tower your fans  What kind of test of courage is that?] Language: [Japanese]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3508/3508 [00:05<00:00, 693.51frames/s]
[07:13:46] Transcribe [05:50.912 => 06:01.664] Duration: [10.752] Expanded: [0.0] Prompt: [スキスキスキップ スキスキスキップ] Language: [Japanese]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1075/1075 [00:01<00:00, 635.91frames/s]
[07:13:47] Transcribe [06:01.664 => 06:37.440] Duration: [35.776] Expanded: [0.0] Prompt: [Let's go to Miyawaki.] Language: [Japanese]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3577/3577 [00:04<00:00, 767.11frames/s]
[07:13:52] Transcribe [06:37.440 => 06:43.536] Duration: [6.096] Expanded: [0.0] Prompt: [You don't want to give up.  If you succeed...  Yes.] Language: [Japanese]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 609/609 [00:00<00:00, 765.28frames/s]
[07:13:53] Transcribe [06:43.536 => 07:49.488] Duration: [65.952] Expanded: [0.0] Prompt: [None] Language: [Japanese]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6595/6595 [00:10<00:00, 614.77frames/s]
[07:14:04] Transcribe [07:49.488 => 08:10.112] Duration: [20.624] Expanded: [0.0] Prompt: [Foreigner is looking at us.  Foreigner is looking at us.] Language: [Japanese]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2062/2062 [00:09<00:00, 212.64frames/s]
[07:14:14] Transcribe [08:10.112 => 08:42.352] Duration: [32.24] Expanded: [0.0] Prompt: [やったー!] Language: [Japanese]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3224/3224 [00:03<00:00, 968.93frames/s]
[07:14:17] Transcribe [08:43.344 => 09:56.624] Duration: [73.28] Expanded: [0.0] Prompt: [How was it?] Language: [Japanese]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7327/7327 [00:10<00:00, 680.77frames/s]
[07:14:28] Transcribe [09:56.624 => 09:59.904] Duration: [3.28] Expanded: [0.0] Prompt: [This is hard.  This is the worst.] Language: [Japanese]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 328/328 [00:00<00:00, 860.07frames/s]
[07:14:29] Transcribe [09:59.904 => 10:31.296] Duration: [31.392] Expanded: [0.0] Prompt: [This is the worst.  Excuse me.] Language: [Japanese]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3139/3139 [00:03<00:00, 913.40frames/s]
[07:14:32] Transcribe [10:31.296 => 11:19.136] Duration: [47.84] Expanded: [0.0] Prompt: [A traditional game.  I'm Michael Jackson.] Language: [Japanese]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4784/4784 [00:17<00:00, 280.95frames/s]
[07:14:49] Transcribe [11:19.136 => 11:30.128] Duration: [10.992] Expanded: [0.0] Prompt: [Let's go to the valley.  Please subscribe to the channel.] Language: [Japanese]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1099/1099 [00:02<00:00, 535.09frames/s]
[07:14:51] Transcribe [11:30.128 => 12:17.248] Duration: [47.12] Expanded: [0.0] Prompt: [The secret weapon of HKT48, the crab, is challenged to make you laugh with a one-shot gag.  The secret weapon of HKT48, the crab.] Language: [Japanese]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4712/4712 [00:06<00:00, 699.06frames/s]
[07:14:58] Transcribe [12:17.248 => 12:36.736] Duration: [19.488] Expanded: [0.0] Prompt: [That's a good idea.] Language: [Japanese]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1948/1948 [00:02<00:00, 695.61frames/s]
[07:15:01] Transcribe [12:36.736 => 13:07.072] Duration: [30.336] Expanded: [0.0] Prompt: [Nobi-chan, dinner is ready. Come down.] Language: [Japanese]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3033/3033 [00:04<00:00, 685.28frames/s]
[07:15:06] Transcribe [13:07.072 => 14:04.400] Duration: [57.328] Expanded: [0.0] Prompt: [Thank you very much.  Come back, come back.  Thank you very much.] Language: [Japanese]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5732/5732 [00:07<00:00, 810.14frames/s]
[07:15:13] Transcribe [14:04.400 => 14:20.384] Duration: [15.984] Expanded: [0.0] Prompt: [I didn't think I'd be walking in the middle.  It's because I'm a teacher.] Language: [Japanese]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1598/1598 [00:03<00:00, 487.31frames/s]
[07:15:17] Transcribe [14:20.384 => 15:19.136] Duration: [58.752] Expanded: [0.0] Prompt: [Here, too, the test of courage continued.] Language: [Japanese]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5875/5875 [00:18<00:00, 323.89frames/s]
[07:15:35] Transcribe [15:19.136 => 16:33.280] Duration: [74.144] Expanded: [0.0] Prompt: [I want to see my bat.  Of course.  I want to see my bat.] Language: [Japanese]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7414/7414 [00:19<00:00, 373.94frames/s]
[07:15:55] Transcribe [16:33.280 => 17:50.416] Duration: [77.136] Expanded: [0.0] Prompt: [This is mixed Monja.  Thank you for watching.] Language: [Japanese]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7713/7713 [00:41<00:00, 186.54frames/s]
[07:16:37] Transcribe [17:50.416 => 19:05.888] Duration: [75.472] Expanded: [0.0] Prompt: [If you actually come, you'll be full of roe.  Don't say it.] Language: [Japanese]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7547/7547 [00:57<00:00, 130.41frames/s]
[07:17:35] Transcribe [19:05.888 => 19:19.232] Duration: [13.344] Expanded: [0.0] Prompt: [It's okay. Don't worry. 凄いじゃん。凄いだよね。 Don't worry. Don't worry.  brick-and-mortar. It's really... 辛い! It's spicy It's okay. It's spicy. もう、表紙からおもろいやん。 It's funny. もう、表紙からおもろいやん。 It's funny from the cover. 私、めっちゃ! I'm so... 諦めるにはまだ早い。 It's too early to give up. 諦めるにはまだ早い。 It's too early to give up.] Language: [Japanese]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1334/1334 [00:04<00:00, 304.87frames/s]
[07:17:39] Transcribe [19:19.232 => 19:53.648] Duration: [34.416] Expanded: [0.0] Prompt: [泣いてる? Are you crying?] Language: [Japanese]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3441/3441 [00:05<00:00, 634.59frames/s]
[07:17:45] Transcribe [19:53.648 => 20:22.752] Duration: [29.104] Expanded: [0.0] Prompt: [I'm embarrassed.  She's a good person.  That's right.] Language: [Japanese]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2910/2910 [00:17<00:00, 168.63frames/s]
[07:18:02] Transcribe [20:22.752 => 21:31.984] Duration: [69.232] Expanded: [0.0] Prompt: [ここで同期の坂口が、 Here, the same period of time, Sakaguchi  メンバーにも見せない他人の一面を語り始めた。 began to talk about the other side that she didn't even show to the members.] Language: [Japanese]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6923/6923 [00:24<00:00, 278.36frames/s]
[07:18:27] Transcribe [21:31.984 => 21:36.944] Duration: [4.96] Expanded: [0.0] Prompt: [この番組は、ユニバーサルミュージックとご覧のスポンサーの提供でお送りします。 This program is brought to you by Universal Music and these sponsors.] Language: [Japanese]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 496/496 [00:01<00:00, 283.67frames/s]
[07:18:29] Transcribe [21:36.944 => 24:21.840] Duration: [164.896] Expanded: [0.0] Prompt: [この番組は、ユニバーサルミュージックとご覧のスポンサーの提供でお送りします。 This program is brought to you by Universal Music and these sponsors.  メンバーが、普段は言えない差し払いの本音を語る。 The members talk about the truth behind the unusual payment.] Language: [Japanese]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16489/16489 [00:47<00:00, 347.20frames/s]
[07:19:17] Transcribe [24:21.840 => 24:30.824] Duration: [8.984] Expanded: [0] Prompt: [息はほぼ顔たい。 ポトルタイプもあるよ。 息はほぼ顔、ロッテ。 悪王、悪王。] Language: [Japanese]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 815/815 [00:00<00:00, 1074.93frames/s]
Max line width 50

Delete [237] - #237 matched by [subscribe] ==> [Please subscribe to the channel.]
Delete [358] - #358 matched by [thank you watching] ==> [Thank you for watching.]

[test.srt] T => [Everyone's first single, スキスキスキップ] => [Everyone's first single, suki suki skip]
[test.srt] T => [If 周りの人から手拍子が来たら、これオッケーとします。] => [If people around you clap your hands, it's okay.]
[test.srt] T => [Singing スキスキスキップ, you can't finish until you get a\Ndebut award.] => [Singing Suki Suki Skip, you can't finish until you get a\Ndebut award.]
[test.srt] C => [スキスキスキップ スキスキスキップ] => [suki suki skip suki suki skip]
[test.srt] T => [外国人の手はこのもの。] => [This is the foreigner's hand.]
[test.srt] T => [めっちゃ集まってきた!] => [We've gathered a lot!]
[test.srt] T => [頑張れ、頑張れ、頑張れ!] => [Do your best, do your best, do your best!]
[test.srt] T => [もうちょっとやろ、もうちょっと!] => [Just a little longer, just a little longer!]
[test.srt] T => [おら!] => [Ora!]
[test.srt] C => [やったー!] => [Yay!]
[test.srt] T => [食べないんですか? Don't you eat it? 食べるよ。 I'll eat it.\N食べてください。 Please eat it. いや、仕事してるからさ。 I'm working.\N こういう時ってこういう人は食べないのよ。 People like this don't eat\Nit. 勉強になるね。 I'll learn. これ食事してんちゃうねん。仕事してんじゃん。\NAren't you eating this? Are you working? そうですよね。\NThat's right.] => [Don't you eat it? I'll eat it.\NPlease eat it. Please eat it. I'm working.\NThis kind of person won't eat at this kind of time. People like this don't eat\Nit. I'll learn. I'm working. \NAren't you eating this? \NThat's right.]
[test.srt] T => [そうだね。反省ノート。今持ってるの。 I have a reflection note. I\Nhave it with me. 今、あっちに置いてきてるの。 I left it over\Nthere. ちょっと取ってきてもらえますか? I'll go get it.\N食べる、食べる、食べる。 No, no, no, no. では、反省ノート。出発です。 Let's\Ngo. 反省ノートとは、谷が自分への励ましなど、普段は見せない本音が綴られたもの。 The\Nreflection note is a document that shows how much\NTani encourages herself. 辛い坂道、諦めるな。谷丸かなら大丈夫。 Don't\Ngive up on the hard road. If you're a Tani, you'll\Nbe fine. 恥ずかしい。 It's embarrassing. 凄いじゃん。 That's\Namazing.  以前、お出か] => [I agree. reflection note. i have it now I have a reflection note. I\Nhave it with me. I left it over\Nthere. No, no, no, no. Departure. Let's\Ngo. Reflection notes are things that Tani usually doesn't show, such as encouraging herself. The\Nreflection note is a document that shows how much\NTani encourages herself. It's okay if it's Tanimaru. Don't\Ngive up on the hard road. If you're a Tani, you'll\Nbe fine. It's embarrassing. That's\Namazing.]
[test.srt] T => [It's okay. Don't worry. 凄いじゃん。凄いだよね。 Don't worry.\NDon't worry.  brick-and-mortar. It's really... 辛い!\NIt's spicy It's okay. It's spicy. もう、表紙からおもろいやん。\NIt's funny. もう、表紙からおもろいやん。 It's funny from the\Ncover. 私、めっちゃ! I'm so... 諦めるにはまだ早い。 It's too early\Nto give up. 諦めるにはまだ早い。 It's too early to give up.] => [It's okay. Don't worry. That's amazing. Don't worry.\NDon't worry. brick-and-mortar. It's really... spicy!\NIt's spicy It's okay. \NIt's funny. It's funny from the\Ncover. It's too early\Nto give up. It's too early to give up.]
[test.srt] T => [恥ずかしい! It's too early to give up.] => [Embarrassing! It's too early to give up.]
[test.srt] T => [何やろな。空気読むとか。 I wonder what it is. Reading the\Natmosphere.] => [What are you doing? Like reading the atmosphere. I wonder what it is.]
[test.srt] T => [泣いてる? Are you crying?] => [Are you crying?]
[test.srt] T => [見返さへんねん。 You're not looking back.] => [Look back. You're not looking back.]
[test.srt] T => [いや、ちょっと恥ずかしいですから。 I'm embarrassed.] => [No, it's a little embarrassing. I'm embarrassed.]
[test.srt] T => [本物の真面目やな。 This is so serious.] => [Genuine and serious. This is so serious.]
[test.srt] T => [これはとってもの。 This is very...] => [This is great. This is very...]
[test.srt] T => [今日の自分がプラス2だと思う。 I think today's me is a plus.] => [I think I'm a plus two today. I think today's me is a plus.]
[test.srt] T => [怖いな。 It's scary.] => [it's scary. It's scary.]
[test.srt] T => [歌のライブみたい。 It's like a live performance.] => [It's like singing live. It's like a live performance.]
[test.srt] T => [マジで凄いね。 It's really amazing.] => [It's really amazing. It's really amazing.]
[test.srt] T => [ここで同期の坂口が、 Here, the same period of time,\NSakaguchi] => [Here, the same period of time,\NSakaguchi]
[test.srt] T => [メンバーにも見せない他人の一面を語り始めた。 began to talk about the\Nother side that she didn't even show to the\Nmembers.] => [He began to talk about a side of others that he didn't show even to the members. began to talk about the\Nother side that she didn't even show to the\Nmembers.]
[test.srt] T => [意外と心配症なんですよ。 I have a lot of worries.] => [I have a strange anxiety disorder. I have a lot of worries.]
[test.srt] T => [私、結構いろいろ話したりするんですよ。 I talk about a lot of things.] => [I talk a lot. I talk about a lot of things.]
[test.srt] T => [心配な時は、手とか蹴って入れに行ってきます。 When I'm worried, I go to\Nthe kitchen to wash my hands.] => [When I'm worried, I kick my hand and go to put it in. When I'm worried, I go to\Nthe kitchen to wash my hands.]
[test.srt] C => [可愛い! So cute!] => [So cute!]
[test.srt] T => [そうなの? So cute!] => [Is that so? So cute!]
[test.srt] T => [だいじょうぶだよってちゃんとこう。 So I tell them, it's okay.] => [It's okay, so say this properly. So I tell them, it's okay.]
[test.srt] T => [これはね、やっぱりスタジオで見てるわれわれには This is what we see in the\Nstudio.] => [This is what we see in the\Nstudio.]
[test.srt] T => [これ伝わってこなかったやつやから、いいやん。 This is what we didn't get.] => [This is a guy who didn't get it, so it's good. This is what we didn't get.]
[test.srt] T => [それ、励ましてくれたりするの? Do you encourage each other?] => [Do you encourage each other?]
[test.srt] T => [そうですね、お互いに励ましだったり。 Yes, we encourage each other.] => [Well, we encouraged each other. Yes, we encourage each other.]
[test.srt] T => [泣きそう。泣くか。 I feel like I'm going to cry.] => [I am going to cry. do you cry I feel like I'm going to cry.]
[test.srt] T => [なんで?なんで?なんで? What is this?] => [What is this?]
[test.srt] T => [ほら、本当に感動も最初からなかった。 Look, I was really moved at the\Nbeginning.] => [See, I wasn't really impressed from the beginning. Look, I was really moved at the\Nbeginning.]
[test.srt] T => [涙の連鎖が始まるのよ。 It's a series of tears.] => [A chain of tears begins. It's a series of tears.]
[test.srt] T => [全然、泣くくらいになった。 It's hard to cry.] => [I almost cried. It's hard to cry.]
[test.srt] T => [全然、ことに伝わってこない。 I can't get the message at all.] => [It doesn't come across at all. I can't get the message at all.]
[test.srt] T => [そうやろ。 That's right.] => [yeah. That's right.]
[test.srt] T => [可愛いけど。 It's cute.] => [Cute though. It's cute.]
[test.srt] T => [ちょっと、俺も泣くから目ついてくれよ。 I'm going to cry, so look at\Nme.] => [Wait a minute, I'm going to cry too, so keep an eye on me. I'm going to cry, so look at \Nme.]
[test.srt] T => [誰もこんなこと予想してなかった。 No one expected this.] => [No one expected this. No one expected this.]
[test.srt] T => [サシハラのことはどう思ってるの? What do you think about\NSashihara?] => [What do you think about\NSashihara?]
[test.srt] T => [この後、メンバーが普段は言えないサシハラへの本音を語る。 Will the members talk\Nabout their true feelings for Sashihara?] => [After this, the members talk about their true feelings about Sashihara, which they usually don't say. Will the members talk\Nabout their true feelings for Sashihara?]
[test.srt] T => [この番組は、ユニバーサルミュージックとご覧のスポンサーの提供でお送りします。 This\Nprogram is brought to you by Universal Music and\Nthese sponsors.] => [This show is brought to you by Universal Music and our sponsors. This\Nprogram is brought to you by Universal Music and\Nthese sponsors.]
[test.srt] T => [メンバーが、普段は言えない差し払いの本音を語る。 The members talk about\Nthe truth behind the unusual payment.] => [The members talk about the real intention of repayment that they can not usually say. The members talk about\Nthe truth behind the unusual payment.]
[test.srt] T => [差し払いのことはどう思ってるの? What do you think about the\Npayment?] => [What do you think about the\Npayment?]
[test.srt] T => [差し払いは親友です。 It's my best friend.] => [Settlement is your best friend. It's my best friend.]
[test.srt] T => [嬉しい。 I'm happy.] => [happy. I'm happy.]
[test.srt] T => [同じくらいか、差し払いと。 It's about the same, isn't it?] => [About the same amount, or remittance. It's about the same, isn't it?]
[test.srt] T => [そうです。向こうが5期で、私が3期で。 That's right. He's 5th and I'm\N3rd.] => [that's right. The other side is the 5th generation, and I am the 3rd generation. That's right. He's 5th and I'm\N3rd.]
[test.srt] T => [先輩なんでな。 I've been a senior for a year.] => [Why are you a senior? I've been a senior for a year.]
[test.srt] T => [なんか落ち込んでると気づいてくれるんですよ。 They notice me when I'm\Ndepressed.] => [You will notice when you are depressed. They notice me when I'm\Ndepressed.]
[test.srt] T => [何ていう声かけてくるの? What do you say to them?] => [What do you say to them?]
[test.srt] T => [でも、差し払い見ると安心して泣いちゃうんですよ。 I feel relieved and cry\Nwhen I look at them.] => [But when I see it paid off, I feel relieved and cry. I feel relieved and cry\Nwhen I look at them.]
[test.srt] T => [なんかお姉ちゃんがいるんですけど、そんな感じなんですかね。 I don't know why,\Nbut I feel relieved when I look at them.] => [I have an older sister, but I wonder if it's like that. I don't know why,\Nbut I feel relieved when I look at them.]
[test.srt] T => [わかんないんですけど嬉しい。 I don't know why, but I'm happy.] => [I don't know, but I'm happy. I don't know why, but I'm happy.]
[test.srt] T => [今度はそういう意味なの大好きなんや。 I love it.] => [I love that it means that this time. I love it.]
[test.srt] T => [大好きです。 I love it.] => [I love it. I love it.]
[test.srt] T => [前の日とかに音楽番組出てたら、センターで踊ってるじゃないですか。 I was on a music\Nshow the other day and I was dancing in the\Ncenter.] => [If you were on a music program the day before, you would be dancing in the center. I was on a music\Nshow the other day and I was dancing in the\Ncenter.]
[test.srt] T => [西村さんとか、ささるさんの前で踊ってるのに、横で普通に喋ってる差し払いが変な感じがする。 I was\Ndancing in front of Nishinobu-san and Hashiru-san,\Nbut I felt strange when I saw them talking to each\Nother.] => [Even though I'm dancing in front of Nishimura-san and Sasaru-san, it feels strange to talk normally next to me. I was\Ndancing in front of Nishinobu-san and Hashiru-san,\Nbut I felt strange when I saw them talking to each\Nother.]
[test.srt] T => [えー、でもみんな、まあ憧れは憧れない。 But I don't admire them.] => [Well, everyone, well, longing does not long. But I don't admire them.]
[test.srt] T => [優しいみたいになりたい。 I want to be kind.] => [I want to be kind I want to be kind.]
[test.srt] T => [すごいね。 That's amazing.] => [amazing. That's amazing.]
[test.srt] T => [いや、メンバーのことを思って、いつも言ってますよ。スタジオでも。 I'm always\Ntalking about the members in the studio.] => [No, thinking about the members, I always say that. Even in the studio. I'm always\Ntalking about the members in the studio.]
[test.srt] T => [言うてる言うてる。 I'm talking about them.] => [Say it, say it. I'm talking about them.]
[test.srt] T => [優しい、優しい。 I'm kind, I'm kind.] => [Gentle, gentle. I'm kind, I'm kind.]
[test.srt] T => [マジャレはなしにしよう。 I don't want to make a joke.] => [Let's do without majare. I don't want to make a joke.]
[test.srt] T => [オデッセンがとうとう様々な経験を積み、また一つ成長できたところで、東京ハトバスクは終了。 Ode-\Nsen has gained a lot of experience and has grown a\Nlot. The Tokyo Hato-bus tour is over.] => [Odyssen has finally gained various experiences and has grown one more time, and Tokyo Hatobask is over. Ode-\Nsen has gained a lot of experience and has grown a\Nlot.]
[test.srt] T => [タニーがすごい、私、タニーのことすごい好きになっちゃった。 I really like Tani.] => [Tanny is amazing, I fell in love with Tanny. I really like Tani.]
[test.srt] T => [タニースペシャルというか、タニーのこと応援したくなるようなことが多かった。 It was like\Na Tani special. I wanted to support Tani.] => [There were many things that made me want to cheer for Tanny, or rather Tanny special. It was like\Na Tani special. I wanted to support Tani.]
[test.srt] T => [みんなで、もっとタニーを潰しに行かなきゃいけない。 We all have to crush\NTani more.] => [We all have to go kill more tunnies. We all have to crush\NTani more.]
[test.srt] T => [ちょっと今回のオデッセン評価、俺やオデッセン。評価してよ。 I'm going to\Nevaluate this time's Ode-sen.] => [A little bit of Odyssen evaluation this time, me and Odyssen. Rate me. I'm going to\Nevaluate this time's Ode-sen.]
[test.srt] T => [炎とさしあらば、今回のオデッセン後藤をからっち評価。 I'm going to evaluate\Nthis time's Ode-sen Goto.] => [If it's a flame, I'll rate Odyssen Goto this time. I'm going to evaluate\Nthis time's Ode-sen Goto.]
[test.srt] T => [仕事減るわ、俺。 I'm going to lose my job.] => [I'll lose my job, man. I'm going to lose my job.]
[test.srt] T => [今回、フットボールアワー、後藤さんがオデッセンやってくれたわけなんですけど。 Goto-san\Ndid the Ode-sen for us this time.] => [This time, during Football Hour, Mr. Goto played Odyssen. Goto-san\Ndid the Ode-sen for us this time.]
[test.srt] T => [まあ、何ですかね、無難にこなしてた。 I was doing it safely.] => [Well, I don't know, I was doing it safely. I was doing it safely.]
[test.srt] T => [言おう、そんなこと、仕事減るわ、俺。 I'm going to lose my job.] => [Let's just say, it's going to take me less work. I'm going to lose my job.]
[test.srt] T => [ただ一つ言えることは、スタジオでV見る感じのほうがええわ。 But I think it's\Nbetter to watch the V in the studio.] => [The only thing I can say is that it feels better to watch V in the studio. But I think it's\Nbetter to watch the V in the studio.]
[test.srt] T => [息はほぼ顔たい。 ポトルタイプもあるよ。 息はほぼ顔、ロッテ。 悪王、悪王。] => [I almost want to breathe. There is also a bottle type. The breath is almost on the face, Lotte. Evil King, Evil King.]

translate: [test.mkv] is complete. Took [0:07:18.371838]

Calls {'session': {'api': 81, 'cache': 3}, 'global': {'api': 2094, 'cache': 191}}

DigitLib · 2022-11-27T13:38:03Z

DigitLib
Nov 27, 2022

@aadnk Thank you for this non English improvement!
Tried to run Docker (both form GitLab and build myself) container failed to start and got this error:
app.py: error: unrecognized arguments: --input_audio_max_duration -1 --server_name 0.0.0.0

3 replies

aadnk Nov 27, 2022
Author

No problem. 👍

But yeah, looks like I forgot to test the default "CMD" in the Dockerfile. But I believe it should work now (commit):

sudo docker run -d --gpus all -p 7860:7860 \
--mount type=bind,source=/home/administrator/.cache/whisper,target=/root/.cache/whisper \
--restart=on-failure:15 registry.gitlab.com/aadnk/whisper-webui:latest

You have to update the --mount line to point to a directory where the Whisper model may be stored (or leave it out, if you don't mind that the Whisper models gets redownloaded every so often). Also note that this version (6/latest) of the Docker container will use all available GPUs and up to 8 CPU cores to perform both the Silero VAD and the Whisper transcribing in parallel.

You can disable this by setting auto_parallel to False (or not including it)

sudo docker run -d --gpus all -p 7860:7860 \
--mount type=bind,source=/home/administrator/.cache/whisper,target=/root/.cache/whisper \
--restart=on-failure:15 registry.gitlab.com/aadnk/whisper-webui:latest
app.py --input_audio_max_duration -1 --server_name 0.0.0.0

DigitLib Nov 28, 2022

TNX! Works now :)
Test also with this fork for low VRAM, found on this dicussion #259 (Large model with 8 GB) work also! On Quadro RTX 4000 laptop...
Transcribe and Translate options on Serbian works better with large model.
Tested with two files 13' and 22' t=60% shorter than duration, no parallel option enabled
TNX again!

DigitLib Nov 28, 2022

Made a fork with this low VRAM support for large image.
https://github.com/DigitLib/whisper-webui-vad

TNX!

aadnk · 2022-12-07T03:40:26Z

aadnk
Dec 7, 2022
Author

It seems like Large-V2 in the most recent version of Whisper is a huge improvement when transcribing Japanese. I tested it on "Macross Frontier - the Movie" as above, and it no longer breaks after 8 minutes:

Large-V1 (transcribed at 2022-10-02) - no VAD:

Large-V2 (latest version at 2022-12-07):

There's still some timing issues after a period of silence, but using a VAD as a workaround may no longer be strictly necessary.

7 replies

arabcoders Dec 7, 2022

Thanks for the tips, could you suggest better Max merge size window for silero-vad ?

aadnk Dec 7, 2022
Author

@arabcoders, I used 30 seconds for large-v1 as it would often fall into a pattern of jibberish if I let it run on Japanese audio for too long on its own. But seeing that large-v2 can run continiously for nearly 2 hours and still function properly, you can probably set it to a very high number. The main issue with setting Max merge size to a low value, is that this might force the WebUI to split of a section if it goes too far above the maximum, which in turn can cause Whisper to not recognize the end of a section and start of the following section where the forced split occurred. I had to use 30 seconds for large-v1 to avoid too mistakes/timing issues, but for large-v2 you should be able to safely set Max merge size to a very high number (600s, 3600s, etc.).

The upside of using a VAD, is that you still avoid some of the timing issues due to non-speech as I mentioned above. For instance, when I ran large-v2 on "Macross Frontier - the Movie", I noticied that the first line of the movie (宇宙には上も下もない) started at 00:00:00, when it really should start at 00:00:13. This then affected the timing of every subsequent line for some time. But large-v2 did eventually recover though - after having some issues with the music in the opening, it recovered completely at 00:01:59. Still, if your content has long periods of non-speech/music, there's still a use for silero-vad. The difference is that the large model is now able to completely recover after these non-speech sections, abit a bit more slowly than when you use a VAD.

arabcoders Dec 7, 2022

Thanks for the explanation, i mostly use whisper on Japanese's variety shows some them do include heavy music usage as such, it's kind of almost required to have VAD to realign the recognizer otherwise it starts to generate non-sense. Thanks I'll start with 600s to see how it goes.

dgoryeo Dec 9, 2022

@aadnk , would I be able to do batch processing, running a loop command in colab like:

for file in drive_folder:
if file.endswith(".mp3"):
!cd /content/drive/MyDrive/ && python /content/whisper-webui/cli.py --model large --vad silero-vad --vad_max_merge_size 600 --language Japanese file

Or would this mess up with vad/whisper cache ?

aadnk Dec 10, 2022
Author

@dgoryeo: The easiest solution is just to use wildcards, which will be expanded to multiple file names by bash:
!cd /content/drive/MyDrive/AI/TestMultiple/ && python /content/whisper-webui/cli.py --model large --language Japanese *.webm

For instance, if the Drive folder contained the files "audio_01.webm", "audio_02.webm", "audio_01.webm", then the command will use Whisper to transcribe all the "webm"-files and output the transcript/SRT/VTT of each in the same folder:

Or you can use the latest version of the WebUI, which allows you to upload multiple files in the "Upload Files" section, and download the result as a ZIP file or individual files:

ghost · 2022-12-09T23:22:46Z

ghost
Dec 9, 2022

firstly thank you so much this is what I want! but i wonder i have no code skills etc. i am using google colab and i wonder is there any saving options instead of google drive, i asked it because i do whisper stuff when I go to sleep let the google colab do it, so i dont want to lose any stuff hope it clears do you any idea ?

1 reply

aadnk Dec 11, 2022
Author

So you'd like to save the transcript/SRT/VTT files to some Cloud drive automatically, in case the Google Colab instance is shut down while you're away from your computer?

I think the best solution is just to add support saving the generated files to a specific directory in the WebUI as well, and then automatically save all the files to a Google Drive folder:

I've updated both Whisper WebUI and the Google Colab document to support this - just run the code below "Mount Google Drive", and let the document have write/read access to your Drive. Then run the Whisper command under "Let Whisper use Google Drive as Temp Directory ...". This will start the Web UI, but save all generated files to the "Whisper" folder in your Drive (you may have to create this folder first, before you start the WebUI):

A13501350 · 2022-12-15T16:25:32Z

A13501350
Dec 15, 2022

Your work is incredible, but as a beginner, I really don't know what went wrong.
Traceback (most recent call last): File "C:\ProgramData\Anaconda3\envs\py310\lib\site-packages\gradio\routes.py", line 297, in run_predict output = await app.blocks.process_api( File "C:\ProgramData\Anaconda3\envs\py310\lib\site-packages\gradio\blocks.py", line 1005, in process_api inputs = self.preprocess_data(fn_index, inputs, state) File "C:\ProgramData\Anaconda3\envs\py310\lib\site-packages\gradio\blocks.py", line 915, in preprocess_data processed_input.append(block.preprocess(inputs[i])) File "C:\ProgramData\Anaconda3\envs\py310\lib\site-packages\gradio\components.py", line 2218, in preprocess return [process_single_file(f) for f in x] File "C:\ProgramData\Anaconda3\envs\py310\lib\site-packages\gradio\components.py", line 2218, in <listcomp> return [process_single_file(f) for f in x] File "C:\ProgramData\Anaconda3\envs\py310\lib\site-packages\gradio\components.py", line 2192, in process_single_file file = processing_utils.decode_base64_to_file( File "C:\ProgramData\Anaconda3\envs\py310\lib\site-packages\gradio\processing_utils.py", line 269, in decode_base64_to_file data, extension = decode_base64_to_binary(encoding) File "C:\ProgramData\Anaconda3\envs\py310\lib\site-packages\gradio\processing_utils.py", line 259, in decode_base64_to_binary extension = get_extension(encoding) File "C:\ProgramData\Anaconda3\envs\py310\lib\site-packages\gradio\processing_utils.py", line 71, in get_extension extension = mimetypes.guess_extension(type) File "C:\ProgramData\Anaconda3\envs\py310\lib\mimetypes.py", line 341, in guess_extension return _db.guess_extension(type, strict) File "C:\ProgramData\Anaconda3\envs\py310\lib\mimetypes.py", line 196, in guess_extension extensions = self.guess_all_extensions(type, strict) File "C:\ProgramData\Anaconda3\envs\py310\lib\mimetypes.py", line 175, in guess_all_extensions type = type.lower() AttributeError: 'NoneType' object has no attribute 'lower'
Is this available on Windows? I use Anaconda in python3.10 according to a tutorial from the website. And think you for your time.

1 reply

aadnk Dec 16, 2022
Author

Looks like Gradio is passing a file to Python's mimetypes.guess_type, which is then returning None, something Gradio is not expecting.

I tried a few different file types on my Windows 10 machine, but I'm unable to reproduce the issue. Perhaps you can try running pip list in your Anaconda environment and compare it to my environment, which should work:

(whisper2) >pip list
Package            Version
------------------ ---------
accelerate         0.13.2
aiohttp            3.8.3
aiosignal          1.2.0
anyio              3.6.1
async-timeout      4.0.2
attrs              22.1.0
bcrypt             4.0.0
Brotli             1.0.9
certifi            2022.9.14
cffi               1.15.1
charset-normalizer 2.1.1
click              8.1.3
colorama           0.4.5
contourpy          1.0.5
cryptography       38.0.1
cycler             0.11.0
fastapi            0.85.0
ffmpeg-python      0.2.0
ffmpy              0.3.0
filelock           3.8.0
fonttools          4.37.4
frozenlist         1.3.1
fsspec             2022.8.2
future             0.18.2
gradio             3.4.0
h11                0.12.0
httpcore           0.15.0
httpx              0.23.0
huggingface-hub    0.10.0
idna               3.4
Jinja2             3.1.2
kiwisolver         1.4.4
linkify-it-py      1.0.3
markdown-it-py     2.1.0
MarkupSafe         2.1.1
matplotlib         3.6.0
mdit-py-plugins    0.3.1
mdurl              0.1.2
mkl-fft            1.3.1
mkl-random         1.2.2
mkl-service        2.4.0
more-itertools     8.14.0
multidict          6.0.2
mutagen            1.45.1
numpy              1.23.1
orjson             3.8.0
packaging          21.3
pandas             1.5.0
paramiko           2.11.0
Pillow             9.2.0
pip                22.2.2
psutil             5.9.3
pycparser          2.21
pycryptodome       3.15.0
pycryptodomex      3.15.0
pydantic           1.10.2
pydub              0.25.1
PyNaCl             1.5.0
pyparsing          3.0.9
python-dateutil    2.8.2
python-multipart   0.0.5
pytz               2022.4
PyYAML             6.0
regex              2022.9.13
requests           2.28.1
sentencepiece      0.1.97
setuptools         63.4.1
six                1.16.0
sniffio            1.3.0
starlette          0.20.4
tokenizers         0.12.1
torch              1.10.1
torchaudio         0.10.1
torchvision        0.11.2
tqdm               4.64.1
transformers       4.22.2
typing_extensions  4.3.0
uc-micro-py        1.0.1
urllib3            1.26.12
uvicorn            0.18.3
websockets         10.3
wheel              0.37.1
whisper            1.0
wincertstore       0.2
yarl               1.8.1
yt-dlp             2022.9.1

Python is also using the Windows registry to fetch mime types - perhaps there's an issue there? You could try another Windows machine/VM, if you have one available.

fznx922 · 2022-12-16T06:25:26Z

fznx922
Dec 16, 2022

Hey bud, absolutely love using your code to translate my japanese shows, as of today i seem to be having some kind of error, the code executes but i never get a subtitle file or transcript, it worked fine yesterday when used, can you confirm?

thank you so much :)

1 reply

aadnk Dec 16, 2022
Author

There seems to be some issues with executing long-running functions in Gradio 3.13.1 and 3.14.0. I've downgraded to 3.13.0 for now in "requirements.txt", so if you clone/update the repository it should work.

If you're running Whisper on Google Colab, you can just deallocate your instance and run all the steps again.

NivaucchuRabuessa · 2022-12-21T11:52:01Z

NivaucchuRabuessa
Dec 21, 2022

Hello! I followed all the instructions and I can launch the webui, but when I click on "submit" after uploading the file I want to transcribe I get the following error:

Traceback (most recent call last):
  File "C:\Users\DiffusionVM\.conda\envs\whisper\lib\site-packages\gradio\routes.py", line 297, in run_predict
    output = await app.blocks.process_api(
  File "C:\Users\DiffusionVM\.conda\envs\whisper\lib\site-packages\gradio\blocks.py", line 1005, in process_api
    inputs = self.preprocess_data(fn_index, inputs, state)
  File "C:\Users\DiffusionVM\.conda\envs\whisper\lib\site-packages\gradio\blocks.py", line 915, in preprocess_data
    processed_input.append(block.preprocess(inputs[i]))
  File "C:\Users\DiffusionVM\.conda\envs\whisper\lib\site-packages\gradio\components.py", line 2218, in preprocess
    return [process_single_file(f) for f in x]
  File "C:\Users\DiffusionVM\.conda\envs\whisper\lib\site-packages\gradio\components.py", line 2218, in <listcomp>
    return [process_single_file(f) for f in x]
  File "C:\Users\DiffusionVM\.conda\envs\whisper\lib\site-packages\gradio\components.py", line 2192, in process_single_file
    file = processing_utils.decode_base64_to_file(
  File "C:\Users\DiffusionVM\.conda\envs\whisper\lib\site-packages\gradio\processing_utils.py", line 269, in decode_base64_to_file
    data, extension = decode_base64_to_binary(encoding)
  File "C:\Users\DiffusionVM\.conda\envs\whisper\lib\site-packages\gradio\processing_utils.py", line 259, in decode_base64_to_binary
    extension = get_extension(encoding)
  File "C:\Users\DiffusionVM\.conda\envs\whisper\lib\site-packages\gradio\processing_utils.py", line 67, in get_extension
    encoding = encoding.replace("audio/wav", "audio/x-wav")
AttributeError: 'NoneType' object has no attribute 'replace'

Any idea how to fix it? Python 3.9.12

4 replies

NivaucchuRabuessa Dec 21, 2022

Created a new VM, followed all the instructions at https://gitlab.com/aadnk/whisper-webui/-/blob/main/docs/windows/install_win10_win11.pdf, and I still get the same exact error.

NivaucchuRabuessa Dec 21, 2022

It seems it doesn't like big video files (>1GB). If I split the audio with ffmpeg and use that file instead, whisper runs without any issue.

aadnk Dec 21, 2022
Author

Have you tried another format beside WAV? You could also encode it lossessly with FLAC using FFMPEG, or perhaps compress it with AAC. Or, you could encode the WAV to the format Whisper expects, which might be smaller:

ffmpeg -i input.wav -acodec pcm_s16le output.wav

NivaucchuRabuessa Dec 21, 2022

The original file was a .mp4. I used ffmpeg -i video.mp4 -q:a 0 -map a audio.mp3 and that works perfectly.
Kinda weird because https://github.com/openai/whisper can read my video.mp4 file and run. Whatever, it just take a few seconds to split the audio anyway.

franka47 · 2022-12-21T20:14:28Z

franka47
Dec 21, 2022

Hello, I have been trying to run your setup with AMD gpu.
At first my GPU wasn't detected but it was solved using the same approach as for Stable Diffusion's WebUI (automatic1111) and the normal Whisper.

GPU is detected and the WebUI starts correctly:

[Auto parallel] Using GPU devices ['0'] and 8 CPU cores for VAD/transcription.

My issue is that after filling everything in the UI and starting the Transcription, I get this error:

OSError: libc10_cuda.so: cannot open shared object file: No such file or directory

Complete logs from the python command:

root@sdm:/dockerx/whisper-webui# python app.py --input_audio_max_duration -1 --server_name 127.0.0.1 --auto_parallel True [Auto parallel] Using GPU devices ['0'] and 8 CPU cores for VAD/transcription. Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch().
[youtube] r-wwMk5faXo: Downloading webpage
[youtube] r-wwMk5faXo: Downloading android player API JSON
[info] r-wwMk5faXo: Downloading 1 format(s): 251
[download] Destination: /tmp/tmprpzmyxt2/You can pip install directly from GitHub [r-wwMk5faXo].webm
[download] 100% of 884.56KiB in 00:00:00 at 2.07MiB/s
Downloaded /tmp/tmprpzmyxt2/You can pip install directly from GitHub [r-wwMk5faXo].webm
Using cache found in /root/.cache/torch/hub/snakers4_silero-vad_master
Deleting source file /tmp/tmprpzmyxt2/You can pip install directly from GitHub [r-wwMk5faXo].webm
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/gradio/routes.py", line 321, in run_predict
output = await app.blocks.process_api(
File "/opt/conda/lib/python3.8/site-packages/gradio/blocks.py", line 1015, in process_api
result = await self.call_function(fn_index, inputs, iterator, request)
File "/opt/conda/lib/python3.8/site-packages/gradio/blocks.py", line 856, in call_function
prediction = await anyio.to_thread.run_sync(
File "/opt/conda/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/opt/conda/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/opt/conda/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "app.py", line 121, in transcribe_webui
result = self.transcribe_file(model, source.source_path, selectedLanguage, task, vad, vadMergeWindow, vadMaxMergeSize, vadPadding, vadPromptWindow)
File "app.py", line 196, in transcribe_file
process_gaps = self._create_silero_config(NonSpeechStrategy.CREATE_SEGMENT, vadMergeWindow, vadMaxMergeSize, vadPadding, vadPromptWindow)
File "app.py", line 264, in _create_silero_config
self.vad_model = VadSileroTranscription()
File "/dockerx/whisper-webui/src/vad.py", line 406, in init
self._initialize_model()
File "/dockerx/whisper-webui/src/vad.py", line 414, in _initialize_model
self.model, self.get_speech_timestamps = self._create_model()
File "/dockerx/whisper-webui/src/vad.py", line 418, in _create_model
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', model='silero_vad')
File "/opt/conda/lib/python3.8/site-packages/torch/hub.py", line 542, in load
model = _load_local(repo_or_dir, model, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/hub.py", line 569, in _load_local
hub_module = _import_module(MODULE_HUBCONF, hubconf_path)
File "/opt/conda/lib/python3.8/site-packages/torch/hub.py", line 90, in _import_module
spec.loader.exec_module(module)
File "", line 843, in exec_module
File "", line 219, in _call_with_frames_removed
File "/root/.cache/torch/hub/snakers4_silero-vad_master/hubconf.py", line 4, in
from utils_vad import (init_jit_model,
File "/root/.cache/torch/hub/snakers4_silero-vad_master/utils_vad.py", line 2, in
import torchaudio
File "/opt/conda/lib/python3.8/site-packages/torchaudio/init.py", line 1, in
from torchaudio import ( # noqa: F401
File "/opt/conda/lib/python3.8/site-packages/torchaudio/_extension.py", line 135, in
_init_extension()
File "/opt/conda/lib/python3.8/site-packages/torchaudio/_extension.py", line 105, in _init_extension
_load_lib("libtorchaudio")
File "/opt/conda/lib/python3.8/site-packages/torchaudio/_extension.py", line 52, in _load_lib
torch.ops.load_library(path)
File "/opt/conda/lib/python3.8/site-packages/torch/_ops.py", line 573, in load_library
ctypes.CDLL(path)
File "/opt/conda/lib/python3.8/ctypes/init.py", line 373, in init
self._handle = _dlopen(self._name, mode)
OSError: libc10_cuda.so: cannot open shared object file: No such file or directory

It seems to attempt using Cuda, and I really don't understand why as this worked for the normal Whisper.
Do you have any idea how to fix that?

4 replies

aadnk Dec 22, 2022
Author

Looks like the problem is in torchaudio, which is used by silero-vad. I presume you've followed these instructions (or something similar) to get Whisper to run on an AMD device?

pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.1.1

If so, you should try also installing tourchaudio in the same fashion - i.e. by running the command pip install torchaudio --extra-index-url https://download.pytorch.org/whl/rocm5.1.1 (possibly with --force-reinstall or --force-reinstall --no-deps).

franka47 Dec 22, 2022

Yes, exactly, I followed these instructions (on the docker part as this was the only way I could make it detect/using my GPU)
I forgot that it was also necessary to install torchaudio like that as it is not automatic with torch, thank you!

I tried both of your commands but it is still failing at the same line. I can't find if there is another pytorch audio related package that is missing.
Another different library/file seems to be missing though:

OSError: libroctx64.so.1: cannot open shared object file: No such file or directory`

Complete traceback:

Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/gradio/routes.py", line 321, in run_predict output = await app.blocks.process_api( File "/opt/conda/lib/python3.8/site-packages/gradio/blocks.py", line 1015, in process_api result = await self.call_function(fn_index, inputs, iterator, request) File "/opt/conda/lib/python3.8/site-packages/gradio/blocks.py", line 856, in call_function prediction = await anyio.to_thread.run_sync( File "/opt/conda/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "/opt/conda/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "/opt/conda/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 867, in run result = context.run(func, *args) File "app.py", line 121, in transcribe_webui result = self.transcribe_file(model, source.source_path, selectedLanguage, task, vad, vadMergeWindow, vadMaxMergeSize, vadPadding, vadPromptWindow) File "app.py", line 196, in transcribe_file process_gaps = self._create_silero_config(NonSpeechStrategy.CREATE_SEGMENT, vadMergeWindow, vadMaxMergeSize, vadPadding, vadPromptWindow) File "app.py", line 264, in _create_silero_config self.vad_model = VadSileroTranscription() File "/dockerx/whisper-webui/src/vad.py", line 406, in __init__ self._initialize_model() File "/dockerx/whisper-webui/src/vad.py", line 414, in _initialize_model self.model, self.get_speech_timestamps = self._create_model() File "/dockerx/whisper-webui/src/vad.py", line 418, in _create_model model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', model='silero_vad') File "/opt/conda/lib/python3.8/site-packages/torch/hub.py", line 542, in load model = _load_local(repo_or_dir, model, *args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/hub.py", line 569, in _load_local hub_module = _import_module(MODULE_HUBCONF, hubconf_path) File "/opt/conda/lib/python3.8/site-packages/torch/hub.py", line 90, in _import_module spec.loader.exec_module(module) File "", line 843, in exec_module File "", line 219, in _call_with_frames_removed File "/root/.cache/torch/hub/snakers4_silero-vad_master/hubconf.py", line 4, in from utils_vad import (init_jit_model, File "/root/.cache/torch/hub/snakers4_silero-vad_master/utils_vad.py", line 2, in import torchaudio File "/opt/conda/lib/python3.8/site-packages/torchaudio/__init__.py", line 1, in from torchaudio import ( # noqa: F401 File "/opt/conda/lib/python3.8/site-packages/torchaudio/_extension.py", line 135, in _init_extension() File "/opt/conda/lib/python3.8/site-packages/torchaudio/_extension.py", line 105, in _init_extension _load_lib("libtorchaudio") File "/opt/conda/lib/python3.8/site-packages/torchaudio/_extension.py", line 52, in _load_lib torch.ops.load_library(path) File "/opt/conda/lib/python3.8/site-packages/torch/_ops.py", line 573, in load_library ctypes.CDLL(path) File "/opt/conda/lib/python3.8/ctypes/__init__.py", line 373, in __init__ self._handle = _dlopen(self._name, mode) OSError: libroctx64.so.1: cannot open shared object file: No such file or directory

Just in case I also tried the newest stable pytorch ROCm (5.2) but it ends like that as well.
That's only with silero-vad, I tried and it works as normal whisper did, if I put 'none' in the VAD model.

aadnk Dec 22, 2022
Author

I haven't got an AMD GPU to test on at the moment, but have you followed all the installation instructions for the rocm/pytorch Docker container?

https://github.com/RadeonOpenCompute/ROCm-docker/blob/master/quick-start.md

But, silero-vad is actually not using the GPU to perform any inference. It's purely a single-threaded CPU operation - in fact, this is why I spawn multiple Python processes when "--auto-parallel" is "True" to execute silero-vad in parallel. In your case, it will spawn up to 8 processes ("8 CPU cores for VAD/transcription").

So another possibe solution to this, might be to hide the GPU for the "silero-vad" Python processes using "CUDA_VISIBLE_DEVICES=-1". That is, if you followed the installation procedure for ROCM, and it is correctly installed and configured for Docker.

franka47 Dec 27, 2022

Thank you for your answer! Yes I had followed everything but it seemed that the ROCm installation didn't even have the file libroctx64.so.1 so I had this issue. I tried randomly to copy the libroctx64.so to libroctx64.so.1 (in opt/lib/rocm) and that worked 😅
(Not sure if that's a '''clean solution''' but at least it is working.)

Many thanks for your implementation, it works very well now and it is easy to use ! It will help me a lot for my Japanese studies!

tiagoefreitas · 2022-12-27T10:19:33Z

tiagoefreitas
Dec 27, 2022

Hi how can I transcribe english audio and translate to another language? If I choose translate it always outputs in english.

0 replies

FryingPanBrock · 2022-12-27T19:18:47Z

FryingPanBrock
Dec 27, 2022

Thank you so much for creating this and putting it together in such an easy-to-use package.

0 replies

codenan42 · 2023-03-21T20:53:36Z

codenan42
Mar 21, 2023

Hello @aadnk is there a way to use fine tuned model with your webui? Something like this - https://huggingface.co/clu-ling/whisper-large-v2-japanese-5k-steps

7 replies

codenan42 Mar 23, 2023

Thank you so much for your work! I've been waiting to test thing for a while now. I tried all kinds of Japanese fine tune model on a real time translation whisper and the model i provided earlier work decently better than vumichien, but there are still some parts are not fully translated. Either way now I can test all of them and see if it any better than large-v2 from OpenAi. Thank you!

dgoryeo Mar 23, 2023

@whizyre , the Huggingface leaderboard doesn't seem to list the model whisper-large-v2-japanese-5k-steps. Is that because it has not been trained on Common Voice 11 Japanese (ja)? At any rate I'd be very keen to hear the results of your experiment.

@aadnk , thanks for making this config available. It is very helpful!

dgoryeo Mar 23, 2023

Hi @aadnk , would it be possible to set Whisper hyperparameters in the config file, e.g :


args5 = {"language":"Japanese", 
                        "verbose":False, 
                        "task":"transcribe", 
                        "temperature":0, 
                        "best_of":None, 
                        "beam_size":2, #6, 
                        "patience":2, 
                        "length_penalty":0.25, 
                        "suppress_tokens":"-1", 
                        "initial_prompt":promptja, 
                        "condition_on_previous_text":True, 
                        "compression_ratio_threshold":1.9, 
                        "logprob_threshold":-1.0, 
                        "no_speech_threshold":0.15}

The usecase I am thinking about is: to define config files for various audio classes/types, group audios of similar type in one folder, place the corresponding config file in each folder, and to run CLI on each folder to process all files.

aadnk Mar 24, 2023
Author

@dgoryeo: Sure, I've added all the other hyperparameters/configuration options to config.json5 as well:

https://huggingface.co/spaces/aadnk/whisper-webui/commit/1acaa19c189b55e7b62fc20870581af62df4bc7c

You can now set the following additional configuration options:

output_dir
model_dir
device
verbose
task
language
vad_merge_window
vad_max_merge_size
vad_padding
vad_prompt_window
temperature
best_of
beam_size
patience
length_penalty
suppress_tokens
initial_prompt
condition_on_previous_text
fp16
temperature_increment_on_fallback
compression_ratio_threshold
logprob_threshold
no_speech_threshold

dgoryeo Mar 25, 2023

Wow, thanks a lot @aadnk !!!

maloadjav · 2023-03-23T02:11:27Z

maloadjav
Mar 23, 2023

i want to ask this too <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> Virus-free.www.avast.com <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

…

On Wed, 22 Mar 2023 at 06:37, fznx922 ***@***.***> wrote: also while on this topic i was able to find this model https://huggingface.co/vumichien/whisper-large-v2-mix-jp which seems like it had been trained on more steps? — Reply to this email directly, view it on GitHub <#397 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUOZO4DT5WHNYGB6JF3B4NLW5I3R5ANCNFSM6AAAAAARMAHQRE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

0 replies

maloadjav · 2023-03-24T06:12:24Z

maloadjav
Mar 24, 2023

can i use your new code to this .. i mean the code that i can use any model ...
https://colab.research.google.com/github/ANonEntity/WhisperWithVAD/blob/main/WhisperWithVAD.ipynb
is this possible ?
then if i can do that ..
how to do ?? where to add" the new command line "?

1 reply

aadnk Mar 24, 2023
Author

Yes, you should be able to use a VAD with any models - just ensure that silero-vad is selected under VAD in the Web UI.

If you want to use the command line, you can select a VAD with the --vad option:

python cli.py --model MYCUSTOMMODEL --vad silero-vad --language Japanese "https://www.youtube.com/watch?v=4cICErqqRSM"

Where MYCUSTOMMODEL is the name of a model in config.json5.

juvesito · 2023-03-31T10:32:49Z

juvesito
Mar 31, 2023

Hello aadnk, great work on this project. I initially installed it on my computer running Windows 11, and it worked flawlessly. I also tested it on an older hardware setup with CentOS 7 and two K80 GPUs, and it performed admirably.

I wanted to inquire about the diarization aspect of the project. Let me explain: Whisper is doing an excellent job at transcribing, and the VAD is efficiently assisting with synchronization. However, when two people speak simultaneously, the transcriptions sometimes become mixed. I came across another project that addresses this issue: https://github.com/MahmoudAshraf97/whisper-diarization/blob/main/diarize.py.

Do you think it's possible to combine both projects? If so, when would be the optimal time to implement diarization? For instance, if I apply diarization after the VAD, the results may improve, but I won't be able to "colorize" different transcriptions throughout the entire clip or movie.

Thanks in advance!

0 replies

fznx922 · 2023-04-05T07:27:06Z

fznx922
Apr 5, 2023

Hey Aadnk

I've been using your version of faster whisper and was wondering because faster whisper supported word level time stamps if that could improve the transcription quality for japanese? not sure if its possible or if you had tested this already but just a thought i was trying to go about trying it, but currently dont know where to start or if it would conflict with your scripting already,

again thank you for all your hard work mate, love your variation on whisper :)!

1 reply

aadnk Apr 28, 2023
Author

Whisper recently added a word_timestamps option, and this was already supported in Faster Whisper. So I've added support for this in Whisper WebUI as well (see this comment).

zx3777 · 2023-04-25T05:26:53Z

zx3777
Apr 25, 2023

Could you add the --word_timestamps option?

4 replies

aadnk Apr 28, 2023
Author

Sure, I've added this in commit f55c594f. You can now set the --word_timestamps option (or it's associated cousins prepend_punctuations, append_punctuations, and highlight_words) in cli.py, or use the GUI in app.py and set them in the "Word Timestamps" fields.

A bit of a note on highlight_words - it is implemented in the Whisper WebUI itself, as it's apparently not supported yet on the released version of Whisper or Faster Whisper, and it is a really useful feature for creating karaoke subtitles.

Enabling "word_timestamps" will also improve the line-breaking algorithm, as it can now attempt to only create line-breaks on words as determined by Whisper.

zx3777 Apr 28, 2023

when i run new app.py, i got this error and quit:

(whisper) A:\AI\whisper-webui>python app.py --whisper_implementation faster-whisper --input_audio_max_duration -1 --server_name 127.0.0.1 --auto_parallel True
Traceback (most recent call last):
File "A:\AI\whisper-webui\app.py", line 562, in
default_app_config = ApplicationConfig.create_default()
File "A:\AI\whisper-webui\src\config.py", line 129, in create_default
app_config = ApplicationConfig.parse_file(os.environ.get("WHISPER_WEBUI_CONFIG", "config.json5"))
File "A:\AI\whisper-webui\src\config.py", line 142, in parse_file
data = json5.load(f)
File "C:\Users\ZX\AppData\Local\Programs\Python\Python310\lib\site-packages\json5\lib.py", line 44, in load
s = fp.read()
UnicodeDecodeError: 'gbk' codec can't decode byte 0xbf in position 6469: illegal multibyte sequence

fznx922 Apr 28, 2023

i was able to fix the issue by replacing it with

"prepend_punctuations": "\"'“¿([{-",
// if word_timestamps is True, merge these punctuation symbols with the previous word
"append_punctuations": "\"'.?,,!!??::”)]}?",

aadnk Apr 28, 2023
Author

@fznx922: @zx3777:
Sorry, looks like a problem on Windows, where the default character encoding is often not UTF-8.

But I believe this should be fixed now, by specifying that config.json5 should be UTF-8.

zx3777 · 2023-04-28T07:14:00Z

zx3777
Apr 28, 2023

win10 gpu faster-whisper mode

When I use it for slightly longer audio with long gap with no speech, dont use VAD, when transcribing, sometimes it gets stuck and just won't move anymore, or it gets stuck for a long time and then ends abruptly. The dialogue after the jam is not recognized, and the timeline for outputting subtitles only goes up to the time of the jam. After getting stuck, nothing is output in srt file

For example, this mp3 file：https://cyberfile.me/6h1c ，use for "medium/korean", no vad, it will stuck at a certain time almost evertime.

When I use the original whisper to recognize the same audio and sometimes it gets stuck, but it will continue after a while and Never had a situation like the above.

2 replies

aadnk Apr 28, 2023
Author

Strange, I am able to transcribe and translate this file on a 2080 Super (8 GB VRAM) and a 2080 Ti (11 GB). Perhaps your GPU doesn't have enough VRAM for this particular file?

Here's the resulting transcript/translation, if you want to compare with your output

Whisper.zip

I have also sometimes accidentally paused the command line window by selecting some text in Windows (related StackOverflow) - perhaps that could be an issue here as well?

zx3777 Apr 28, 2023

thank you very much . my gpu is 3060ti 8g, It should not be a vram problem, and I have ruled out the window paused problem.

zx3777 · 2023-04-28T08:50:21Z

zx3777
Apr 28, 2023

when i use VAD , sometimes it gets stuck like this. But not every time it gets stuck, the same audio sometimes gets stuck and sometimes doesn't：

To create a public link, set share=True in launch().
Creating whisper container for faster-whisper
Using cache found in C:\Users\xX/.cache\torch\hub\snakers4_silero-vad_master
Created Silerio model
Parallel VAD: Executing chunk from 0 to 120 on CPU device 0
Parallel VAD: Executing chunk from 120 to 240 on CPU device 1
Parallel VAD: Executing chunk from 240 to 360 on CPU device 2
Parallel VAD: Executing chunk from 360 to 480 on CPU device 3
Parallel VAD: Executing chunk from 480 to 600 on CPU device 4
Parallel VAD: Executing chunk from 600 to 611.735 on CPU device 5
Using cache found in C:\Users\xX/.cache\torch\hub\snakers4_silero-vad_master
Using cache found in C:\Users\xX/.cache\torch\hub\snakers4_silero-vad_master
Loaded Silerio model from cache.
Getting timestamps from audio file: C:\Users\xX\AppData\Local\Temp\ff9b58654f38921e6f6279ce8831871601181503\3.mp3, start: 120, duration: 240
Processing VAD in chunk from 02:00.000 to 04:00.000
Loaded Silerio model from cache.
Getting timestamps from audio file: C:\Users\xX\AppData\Local\Temp\ff9b58654f38921e6f6279ce8831871601181503\3.mp3, start: 480, duration: 600
Processing VAD in chunk from 08:00.000 to 10:00.000
C:\Users\xX\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py:1501: UserWarning: operator () profile_node %668 : int[] = prim::profile_ivalue(%666)
does not have profile information (Triggered internally at ..\third_party\nvfuser\csrc\graph_fuser.cpp:108.)
return forward_call(*args, **kwargs)
C:\Users\xX\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py:1501: UserWarning: operator () profile_node %668 : int[] = prim::profile_ivalue(%666)
does not have profile information (Triggered internally at ..\third_party\nvfuser\csrc\graph_fuser.cpp:108.)
return forward_call(*args, **kwargs)
VAD processing took 6.360909100003482 seconds
VAD processing took 6.392508799995994 seconds
Using cache found in C:\Users\xX/.cache\torch\hub\snakers4_silero-vad_master
Loaded Silerio model from cache.
Getting timestamps from audio file: C:\Users\xX\AppData\Local\Temp\ff9b58654f38921e6f6279ce8831871601181503\3.mp3, start: 240, duration: 360
Processing VAD in chunk from 04:00.000 to 06:00.000
C:\Users\xX\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py:1501: UserWarning: operator () profile_node %668 : int[] = prim::profile_ivalue(%666)
does not have profile information (Triggered internally at ..\third_party\nvfuser\csrc\graph_fuser.cpp:108.)
return forward_call(*args, **kwargs)
VAD processing took 6.418204199995671 seconds

4 replies

zx3777 Apr 28, 2023

When auto_parallel True is removed and single-core cpu is used, it can work normally. But much slower than six cores. What exactly is the problem？

zx3777 Apr 28, 2023

Use dual cores at most, and more will get stuck. The number of occurrences of "Using cache found in C:\ Users\ xX/.cache\ torch\ hub\ snakers4_silero-vad_master" must match the number of cores of cpu, otherwise there will be problems. Is it a problem with the python version? I use Python 3.10.8, but the one you recommended in the tutorial is 3.9.12.

aadnk Apr 28, 2023
Author

I can't reproduce this with Python 3.10.8, so I doubt it. And unfortunately, unless I can reproduce this, it's really difficult to fix.

If you've followed my tutorial on Windows, I'd suggest trying the following:

Make sure there are no background processes or other applications consuming GPU memory while running the WebUI. Even closing the browser (Chrome/FireFox) might help, as it may consume VRAM. You can check how much memory is currently used with Task Manager in Windows or the nvidia-smi command line tool.
You should also see how much memory/CPU/GPU is being used while you're running the WebUI, and make sure you're not reaching any potential limits.
Try creating a new conda environment and install the stable version of Torch.
Try upgrading your Graphics/CUDA drivers.
Try running it on a another machine with a similar GPU, if possible, and see if the problem persists.
A last resort might be to try a clean reinstall of Windows, but there is no guarantee that it will fix this issue.

Alternatively, you could just run the Whisper WebUI on Google Colab instead (link). It is free up to a certain usage level.

zx3777 Apr 29, 2023

Thanks a lot for your help, I'll give it a try.

Dale007261 · 2023-05-09T15:02:42Z

Dale007261
May 9, 2023

Today, when i use fast-whisper, it gets stuck like this. I try to change config, but it do not work:

Traceback (most recent call last):
File "E:\whisper-webui\venv\lib\site-packages\gradio\routes.py", line 394, in run_predict
output = await app.get_blocks().process_api(
File "E:\whisper-webui\venv\lib\site-packages\gradio\blocks.py", line 1075, in process_api
result = await self.call_function(
File "E:\whisper-webui\venv\lib\site-packages\gradio\blocks.py", line 884, in call_function
prediction = await anyio.to_thread.run_sync(
File "E:\whisper-webui\venv\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "E:\whisper-webui\venv\lib\site-packages\anyio_backends_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "E:\whisper-webui\venv\lib\site-packages\anyio_backends_asyncio.py", line 867, in run
result = context.run(func, *args)
File "E:\whisper-webui\app.py", line 105, in transcribe_webui_simple_progress
return self.transcribe_webui(modelName, languageName, urlData, multipleFiles, microphoneData, task, vadOptions,
File "E:\whisper-webui\app.py", line 199, in transcribe_webui
result = self.transcribe_file(model, source.source_path, selectedLanguage, task, vadOptions, scaled_progress_listener, **decodeOptions)
File "E:\whisper-webui\app.py", line 299, in transcribe_file
result = self.process_vad(audio_path, whisperCallable, self.vad_model, process_gaps, progressListener=progressListener)
File "E:\whisper-webui\app.py", line 367, in process_vad
return parallel_vad.transcribe_parallel(transcription=vadModel, audio=audio_path, whisperCallable=whisperCallable,
File "E:\whisper-webui\src\vadParallel.py", line 183, in transcribe_parallel
results = results_async.get()
File "C:\Users\dzaq3\AppData\Local\Programs\Python\Python310\lib\multiprocessing\pool.py", line 774, in get
raise self._value
huggingface_hub.utils._errors.HfHubHTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/guillaumekln/faster-whisper-large-v2/revision/main (Request ID: Root=1-645a5edc-4ad69adf0bebd3c90da79240)

Sorry, we can't find the page you are looking for.

1 reply

aadnk May 9, 2023
Author

It seems to be working now, might just have been a problem at HuggingFace.co

MitsuPa · 2023-05-27T03:43:25Z

MitsuPa
May 27, 2023

Hello, I absolutely love your project! I am currently encountering a problem, I want to generate video subtitles by pasting the URL of the video. But will get File name too long error. Can you please fix this?
This is the URL of the video. It's not a YouTube URL. When you visit, you will be redirected, and then the video will start downloading.

0 replies

KellAven · 2023-07-09T04:19:53Z

KellAven
Jul 9, 2023

Hello! Since the GUI can write out the timestamps of when something is being spoken, is there a way for it to use these timestamps to chunk out the sections in the audio file that has speech and save it as an speech_only.wav audio file?

2 replies

aadnk Jul 10, 2023
Author

Sounds like the job of a separate tool, so I wrote a simple CLI that takes an audio file and a SRT, and outputs only the sections that are referenced by the SRT file:

Source: https://gitlab.com/aadnk/audio-cut-silences
Windows Build: cut-silences-win.zip (requires FFMPEG)

Example command:

python cut-silences.py my_audio.mp3 my_subtitles.srt

Or with the EXE:

.\cut-silences.exe my_audio.mp3 my_subtitles.srt

KellAven Jul 11, 2023

You are an absolute legend (especially for the language learning community) thank you!

ZZfive · 2023-07-12T01:23:29Z

ZZfive
Jul 12, 2023

Hello! I forked code from https://gitlab.com/aadnk/whisper-webui/-/tree/main, and ran it locally in Linux system. With a video link which it could be parsed in the huggingface demo provided by the original author, it seemed that the local service had not taken effect, terminal had no output and UI was always displayed in the calculation process. Have you ever encountered such a situation or did you know how to handle it?

Thanks

2 replies

aadnk Jul 25, 2023
Author

I was able to reproduce this on an older version Ubuntu (20.04.4 LTS).

But it seems to be caused by an issue in Gradio, which was fixed when I updated it to the latest version:

pip install gradio==3.38.0

I'll see if I can't switch to 3.38.0 in the repository as well.

UPDATE: I've updated to Gradio 3.38.0 - try pulling the repository again, and installing the requirements.

ZZfive Jul 26, 2023

Thank you for your suggestion ! After I update the gradio, the local service works.

arabcoders · 2023-07-25T15:54:36Z

arabcoders
Jul 25, 2023

@aadnk Thanks for this great tool, I've noticed that when enabling the words_timestamp for better timing in doing so i noticed that at least for jp audio the end time is rather short and ends too fast for poeple to manage to read the the text in time, it probably make sense to account for difference in spoken language and the actual text, and add some leeway for poeple to be able to read the text. if you need samples i can provide you with some

0 replies

Fznx92 · 2023-11-09T01:12:12Z

Fznx92
Nov 9, 2023

Hey Aadnk :) any chance yet to see what the difference / accuracy in performance in v2 vs v3 model on your application in transcribing japanese?

thanks 👍

1 reply

PingZi-Wing Apr 5, 2024

v2 is better than v3，Reference source：https://research.reazon.jp/projects/ReazonSpeech/index.html
https://research.reazon.jp/_images/rtf1.png

drohack · 2024-01-10T05:09:30Z

drohack
Jan 10, 2024

Would it be possible to implement Stable-TS?
https://github.com/jianfch/stable-ts

It's a wrapper on whisper/faster_whisper that has much better timing and grouping of subtitles than default whisper. It also provides some nice helper functions to manipulate the transcription (changing timing, finding/replacing/removing characters/words) (they also have their own functions to write to srt files, and highlight words, similar to what you already have).

I've been able to kind of mash your code with theirs by copying your code into my repo and editing the fasterWhisperContainer.py to run Stable-ts instead.
main.py:

from whisper_webui.src.config import ModelConfig
from whisper_webui.src.whisper.fasterWhisperContainer import FasterWhisperContainer
from whisper_webui.app import WhisperTranscriber, VadOptions

    model = FasterWhisperContainer(model_name='large-v3', device='cuda', compute_type='auto', models=models)
    model.ensure_downloaded()
    vad_options = VadOptions(vad='silero-vad', vadMergeWindow=5, vadMaxMergeSize=180,
                             vadPadding=1, vadPromptWindow=1)
    wwebui = WhisperTranscriber()
    result = wwebui.transcribe_file(model=model, audio_path=audio_file_path, language='Japanese', task='transcribe',
                                    vadOptions=vad_options,
                                    beam_size=5, temperature=0,
                                    word_timestamps=True, condition_on_previous_text=False,
                                    no_speech_threshold=0.1, compression_ratio_threshold=1.4,
                                    logprob_threshold=-1,
                                    )

fasterWhisperContainer.py
Update def _create_model(self):

model = stable_whisper.load_faster_whisper(model_url, device=device, compute_type=self.compute_type, num_workers=2)

Update def invoke():

import stable_whisper
from stable_whisper import WhisperResult

        result: WhisperResult = model.transcribe_stable(audio, \
            language=language_code if language_code else detected_language, task=self.task, \
            initial_prompt=initial_prompt, input_sr=48000,
            **decodeOptions
        )

        segments = []

        for segment in result.segments:
            segments.append(segment)

            #if progress_listener is not None:
            #    progress_listener.on_progress(segment.end, info.duration)
            if verbose:
                print("[{}->{}] {}".format(format_timestamp(segment.start, True), format_timestamp(segment.end, True),
                                          segment.text))

        text = " ".join([segment.text for segment in segments])

        # Convert the segments to a format that is easier to serialize
        whisper_segments = [{
            "text": segment.text,
            "start": segment.start,
            "end": segment.end,

            # Extra fields added by faster-whisper
            "words": [{
                "start": word.start,
                "end": word.end,
                "word": word.word,
                "probability": word.probability
            } for word in (segment.words if segment.words is not None else []) ]
        } for segment in segments]

        result = {
            "segments": whisper_segments,
            "text": text,
            "language": result.language if result else None,

            # Extra fields added by faster-whisper
            "language_probability": None,
            "duration": None
        }

        # If we have a prompt strategy, we need to increment the current prompt
        if self.prompt_strategy:
            self.prompt_strategy.on_segment_finished(segment_index, prompt, detected_language, result)

        if progress_listener is not None:
            progress_listener.on_finished()
        return result

There are a few caveats I had to make:

no info object is returned from stable-ts transcribe so getting the duration is not working. This doesn't seem to affect the run.
you have to pass the input_sr to transcribe_stable() since the audio file is a Torch object. I just have this hardcoded right now, but could be updated to get automatically.

0 replies

maloadjav · 2024-09-14T09:18:52Z

maloadjav
Sep 14, 2024

YOUR whisper on googlecolab have some error
it error when i submit file

https://colab.research.google.com/drive/1qeTSvi7Bt_5RMm88ipW4fkcsMOKlDDss?usp=sharing#scrollTo=-jCbFjWpV_ci

0 replies

Lucas7AC · 2024-09-18T14:24:05Z

Lucas7AC
Sep 18, 2024

what should i do?

0 replies

Whisper WebUI with a VAD for more accurate non-English transcripts (Japanese) #397

VAD

Default Whisper

Using Silero VAD

Downsides

Replies: 29 comments · 60 replies

aadnk Oct 23, 2022 Author

aadnk Oct 25, 2022 Author

aadnk Oct 26, 2022 Author

aadnk Oct 27, 2022 Author

aadnk Nov 23, 2022 Author

aadnk Nov 27, 2022 Author

aadnk Dec 7, 2022 Author

aadnk Dec 7, 2022 Author

aadnk Dec 10, 2022 Author

aadnk Dec 11, 2022 Author

aadnk Dec 16, 2022 Author

aadnk Dec 16, 2022 Author

aadnk Dec 21, 2022 Author

aadnk Dec 22, 2022 Author

aadnk Dec 22, 2022 Author

aadnk Mar 24, 2023 Author

aadnk Mar 24, 2023 Author

aadnk Apr 28, 2023 Author

aadnk Apr 28, 2023 Author

Replies: 29 comments 60 replies

aadnk Oct 23, 2022
Author

aadnk Oct 25, 2022
Author

aadnk Oct 26, 2022
Author

aadnk Oct 27, 2022
Author

aadnk Nov 23, 2022
Author

aadnk Nov 27, 2022
Author

aadnk
Dec 7, 2022
Author

aadnk Dec 7, 2022
Author

aadnk Dec 10, 2022
Author

aadnk Dec 11, 2022
Author

aadnk Dec 16, 2022
Author

aadnk Dec 16, 2022
Author

aadnk Dec 21, 2022
Author

aadnk Dec 22, 2022
Author

aadnk Dec 22, 2022
Author

aadnk Mar 24, 2023
Author

aadnk Mar 24, 2023
Author

aadnk Apr 28, 2023
Author

aadnk Apr 28, 2023
Author