Replies: 7 comments 8 replies
-
Hi @jianfch , I just saw this post. Very good idea indeed. It's a pity that I am not a script programmer, I would have loved to be able to contribute as I think this is a very useful tool for post processing / fine tuning too. Just an idea --which is not exactly related to this thread: such tool function can be quite useful to postprocess output of some "tiktok like" transcriptions: e.g outputs from CapCut transcription function by Tencent. That tool produces tiktok like short-gap transcriptions that make little sense for languages like Japanese --cause the context is missing. I was wondering of your regrouping approach can be used to help functionality of such tools likw CapCut. Just a wild idea :) |
Beta Was this translation helpful? Give feedback.
-
In our use case, natural segmentation was crucial to reduce the manpower needed in subsequent steps. We chose to use NLTK for POS-Tagging, then a tiny BiLSTM model to generate cut indices. Model: https://huggingface.co/metricv/metricsubs-segmenter/tree/main The inference workflow then basically looked like this: def transcribe_audio_stablewhisper_segmenter(path, segmenter_model):
model = stable_whisper.load_model("large-v2")
result = model.transcribe(audio=path, verbose=True, regroup=False, word_timestamps=True, language="en")
result._split_segments(get_indicies_autoembed, args=[segmenter_model, "cuda", 0.5])
result_regrouped = (
result
.split_by_gap(.5)
) This basically replaces It wasn't able to identify gaps though, as time information was not fed in, so it is followed by a It is designed only to work as the first step in splitting and regrouping. The model was trained to a quite usable state with three pieces of text, each around 1700 tokens. |
Beta Was this translation helpful? Give feedback.
-
Hi @Metric-Void , thanks for sharing this. I tried to use the model in this way:
I get an error: config.json missing. Where do I go wrong? |
Beta Was this translation helpful? Give feedback.
-
Not all transcripts have them, but for some, whisper returns words that are capitalized- either due to being nouns or due to it being where it guesses from the audio that a sentence starts. Here's a little code that exploits that for grouping. Ironically enough, stopping on nouns is ok for my use-case. But ideally someone with knowledge of nltk expands this to (optionally) exclude nouns.
|
Beta Was this translation helpful? Give feedback.
-
I'm trying to replicate something similar to:
from Whisper-Standalone. Is this a step in the right direction for the 42 chars per 2 lines?
Eventually want to edit the default with the rules above. Something like : |
Beta Was this translation helpful? Give feedback.
-
Has anyone found an "optimum" regrouping settings for Japanese transcriptions? I have so far played with adjusting the values in default settings. I'm not good at this so I try to sail close to shore :)
I'm keen to hear your thoughts, and experiences. |
Beta Was this translation helpful? Give feedback.
-
Can someone help me with merging punctuation by '/' (slash). For e.g. : "others own rehearsal spaces/studios" , here I want to be able to have spaces/studios in 1 line itself, not different subtitles line. I used I get my .srt file like this:
I think code is assuming my '/' as
and thus ignoring it as a whitespace. Can someone help me with this?? |
Beta Was this translation helpful? Give feedback.
-
In version 2.6.3+, you can write a customized regrouping algorithm as a string.
Not only does this simplify the creation and usage your own regrouping algorithm, it also easier to share it with other.
So please feel free to share your strings in this discussion and let us know what you find them do well in (e.g. other languages).
stable-ts/stable_whisper/result.py
Lines 1905 to 1942 in ef0a87e
Beta Was this translation helpful? Give feedback.
All reactions