Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add Whispering support #454

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

charles-zablit
Copy link
Contributor

@charles-zablit charles-zablit commented Oct 14, 2022

This PR adds support for Whispering a streaming transcription server based on OpenAI's Whisper.

Whispering's advantage over VOSK is that it supports multiple languages detection and transcription.

The Whispering Transcription service uses WebSockets to communicate with the Whispering server.

This is still a WIP as we still need to fix a sample rate incompatibility issue between Whispering and Jigasi.
Right now, we have to set EXPECTED_AUDIO_LENGTH to 25600.
We also have to change https://github.com/shirayu/whispering/blob/256bf38b4d3d751e1eac8116f0f7da07e1b9652f/whispering/serve.py#L69
to audio = np.frombuffer(message, dtype=np.int64)

@nikvaessen
Copy link
Contributor

nikvaessen commented Oct 14, 2022

Some questions:

  1. Have you tested if real-time transcription is feasible?
  2. What model (tiny/base/etc) are you planning on running?
  3. On what GPU are you planning to run this?
  4. Any thoughts on serving transcriptions from one machine to multiple meetings?

@charles-zablit
Copy link
Contributor Author

charles-zablit commented Oct 17, 2022

Have you tested if real-time transcription is feasible?

It works just as fast as VOSK, however it only starts transcribing after the sentence ends. It does not have partial results, which might make it look slow.

What model (tiny/base/etc) are you planning on running?

Currently we have tested both medium and large, with very good performances.

On what GPU are you planning to run this?

We have run our tests on t1-45 OVH VPS, so an NVIDIA Tesla V100.

Any thoughts on serving transcriptions from one machine to multiple meetings?

We have not tested that yet, but it seems that Whispering supports multiple connections.
GPU usage is around 30% on our OVH instance for 1 connection, so multiple connections are doable.

If you want, we plan on presenting our findings at today's Jitsi community call.

@codecov
Copy link

codecov bot commented Oct 17, 2022

Codecov Report

Merging #454 (e645d49) into master (dda0721) will decrease coverage by 0.76%.
The diff coverage is 0.00%.

❗ Current head e645d49 differs from pull request most recent head d8f88ae. Consider uploading reports for the commit d8f88ae to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #454      +/-   ##
============================================
- Coverage     23.15%   22.39%   -0.77%     
  Complexity      304      304              
============================================
  Files            69       70       +1     
  Lines          5812     6006     +194     
  Branches        790      804      +14     
============================================
- Hits           1346     1345       -1     
- Misses         4235     4430     +195     
  Partials        231      231              
Impacted Files Coverage Δ
...rc/main/java/org/jitsi/jigasi/AbstractGateway.java 68.60% <0.00%> (-11.13%) ⬇️
.../java/org/jitsi/jigasi/AbstractGatewaySession.java 63.49% <0.00%> (-4.31%) ⬇️
src/main/java/org/jitsi/jigasi/JvbConference.java 44.28% <0.00%> (-1.39%) ⬇️
src/main/java/org/jitsi/jigasi/Main.java 22.09% <0.00%> (-1.66%) ⬇️
...c/main/java/org/jitsi/jigasi/rest/HandlerImpl.java 0.00% <0.00%> (ø)
...jigasi/transcription/VoskTranscriptionService.java 0.00% <0.00%> (ø)
.../transcription/WhisperingTranscriptionService.java 0.00% <0.00%> (ø)
...in/java/org/jitsi/jigasi/sounds/PlaybackQueue.java 54.38% <0.00%> (-1.76%) ⬇️
.../jitsi/jigasi/sounds/SoundNotificationManager.java 29.62% <0.00%> (+0.41%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4964b52...d8f88ae. Read the comment docs.

ctx.put("no_speech_threshold", 0.6);
ctx.put("buffer_threshold", 0.5);
ctx.put("vad_threshold", 0.5);
ctx.put("data_type", "float32");
Copy link
Contributor

@nikvaessen nikvaessen Oct 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if using GPU, I expect it should be a bit faster with float16. But most time is spent waiting for audio, I guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad this was supposed to be int64, as, from my understanding, this is the audio format Jigasi sends.

I have implemented a convertion to float32 in shirayu/whispering#36 but I will suggest float16 for better performances.

@davidak
Copy link

davidak commented Mar 22, 2023

@charles-zablit do you have a plan to finish this?

It would be a great feature as i think whisper is currently the best open source STT. I would like to use it for meeting notes.

@cryolite-ai
Copy link

cryolite-ai commented Oct 23, 2023

Hi @charles-zablit @nikvaessen Just wondering what happened to this particular Whisper related jigasi integration (which is about a year old)?

Rummaging through the current codebase I see in a file called: https://github.com/jitsi/jigasi/blob/master/src/main/java/org/jitsi/jigasi/transcription/WhisperTranscriptionService.java
which appears to be connected to PR #491

and I see that, although it's not mentioned in the README (which makes reference to Google Cloud, Vosk, LibreTranslate), there is now some recent code to link transcription to some sort of Whisper system but, in contrast to what Charles was doing, in the current code it says 'a custom Whisper server - without any details' and there doesn't seem to be any documentation about what/how to set it up... whereas Charles, over a year ago, was just about ready with something which would use Whispering (which is MIT licensed) https://github.com/shirayu/whispering/ Unfortunately the PR now has conflicts and the Whispering service project has been archived by its original author given availability of new whisper systems e.g. whisper.cpp which works with CPU inference as well as GPU.

Is there any chance we could still have the Whispering PR integrated since it uses whisper from an open service as opposed to whatever is now in the code-base. If we had an example it might be possible to adapt it to suit one of the newer Whisper implementations available these days? I've also seen some scripts which, if given multiple channels, will do some rough diarising so that the transcript will incorporate multiple named speakers..

Many thanks for your work on all of this.

Best, M.

@damencho
Copy link
Member

in the current code it says 'a custom Whisper server - without any details' and there doesn't seem to be any documentation about what/how to set it up...

Where do you see this?

@cryolite-ai
Copy link

Link to source file was in my last post - here it is again:

Rummaging through the current codebase I see in a file called: https://github.com/jitsi/jigasi/blob/master/src/main/java/org/jitsi/jigasi/transcription/WhisperTranscriptionService.java
which appears to be connected to PR #491

See line 27...

image

@rpurdel
Copy link
Contributor

rpurdel commented Oct 25, 2023

Hi @charles-zablit @nikvaessen Just wondering what happened to this particular Whisper related jigasi integration (which is about a year old)?

Rummaging through the current codebase I see in a file called: https://github.com/jitsi/jigasi/blob/master/src/main/java/org/jitsi/jigasi/transcription/WhisperTranscriptionService.java which appears to be connected to PR #491

and I see that, although it's not mentioned in the README (which makes reference to Google Cloud, Vosk, LibreTranslate), there is now some recent code to link transcription to some sort of Whisper system but, in contrast to what Charles was doing, in the current code it says 'a custom Whisper server - without any details' and there doesn't seem to be any documentation about what/how to set it up... whereas Charles, over a year ago, was just about ready with something which would use Whispering (which is MIT licensed) https://github.com/shirayu/whispering/ Unfortunately the PR now has conflicts and the Whispering service project has been archived by its original author given availability of new whisper systems e.g. whisper.cpp which works with CPU inference as well as GPU.

Is there any chance we could still have the Whispering PR integrated since it uses whisper from an open service as opposed to whatever is now in the code-base. If we had an example it might be possible to adapt it to suit one of the newer Whisper implementations available these days? I've also seen some scripts which, if given multiple channels, will do some rough diarising so that the transcript will incorporate multiple named speakers..

Many thanks for your work on all of this.

Best, M.

Hi,

We are still in the very early stage with our own Whisper live transcription implementation. We plan to make it open-source in the not so distant future.

Cheers,
Razvan

@rpurdel
Copy link
Contributor

rpurdel commented Feb 8, 2024

@charles-zablit @nikvaessen @damencho

The whisper live transcription server is now open source under the jitsi/skynet project. It should work out of the box with Jigasi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants