Sharing Customized Regrouping Algorithms #162

jianfch · 2023-06-09T01:11:07Z

jianfch
Jun 9, 2023
Maintainer

In version 2.6.3+, you can write a customized regrouping algorithm as a string.
Not only does this simplify the creation and usage your own regrouping algorithm, it also easier to share it with other.
So please feel free to share your strings in this discussion and let us know what you find them do well in (e.g. other languages).

stable-ts/stable_whisper/result.py

Lines 1905 to 1942 in ef0a87e

    
                   Syntax for string representation of custom regrouping algorithm. 
        
                       Method keys: 
        
                           sg: split_by_gap 
        
                           sp: split_by_punctuation 
        
                           sl: split_by_length 
        
                           sd: split_by_duration 
        
                           mg: merge_by_gap 
        
                           mp: merge_by_punctuation 
        
                           ms: merge_all_segment 
        
                           cm: clamp_max 
        
                           l: lock 
        
                           us: unlock_all_segments 
        
                           da: default algorithm (cm_sp=.* /。/?/？/,* /，_sg=.5_mg=.3+3_sp=.* /。/?/？) 
        
                           rw: remove_word 
        
                           rs: remove_segment 
        
                           rp: remove_repetition 
        
                           rws: remove_words_by_str 
        
                           fg: fill_in_gaps 
        
                       Metacharacters: 
        
                           = separates a method key and its arguments (not used if no argument) 
        
                           _ separates method keys (after arguments if there are any) 
        
                           + separates arguments for a method key 
        
                           / separates an argument into list of strings 
        
                           * separates an item in list of strings into a nested list of strings 
        
                       Notes: 
        
                       -arguments are parsed positionally 
        
                       -if no argument is provided, the default ones will be used 
        
                       -use 1 or 0 to represent True or False 
        
                       Example 1: 
        
                           merge_by_gap(.2, 10, lock=True) 
        
                           mg=.2+10+++1 
        
                           Note: [lock] is the 5th argument hence the 2 missing arguments inbetween the three + before 1 
        
                       Example 2: 
        
                           split_by_punctuation([('.', ' '), '。', '?', '？'], True) 
        
                           sp=.* /。/?/？+1 
        
                       Example 3: 
        
                           merge_all_segments().split_by_gap(.5).merge_by_gap(.15, 3) 
        
                           ms_sg=.5_mg=.15+3

dgoryeo · 2023-07-08T11:29:12Z

dgoryeo
Jul 8, 2023

Hi @jianfch , I just saw this post. Very good idea indeed. It's a pity that I am not a script programmer, I would have loved to be able to contribute as I think this is a very useful tool for post processing / fine tuning too.

Just an idea --which is not exactly related to this thread: such tool function can be quite useful to postprocess output of some "tiktok like" transcriptions: e.g outputs from CapCut transcription function by Tencent. That tool produces tiktok like short-gap transcriptions that make little sense for languages like Japanese --cause the context is missing. I was wondering of your regrouping approach can be used to help functionality of such tools likw CapCut. Just a wild idea :)

0 replies

Metric-Void · 2023-11-11T07:55:53Z

Metric-Void
Nov 11, 2023

In our use case, natural segmentation was crucial to reduce the manpower needed in subsequent steps. We chose to use NLTK for POS-Tagging, then a tiny BiLSTM model to generate cut indices.

Model: https://huggingface.co/metricv/metricsubs-segmenter/tree/main
The part that interacted with StableWhisper is at https://huggingface.co/metricv/metricsubs-segmenter/blob/main/utils.py#L231

The inference workflow then basically looked like this:

def transcribe_audio_stablewhisper_segmenter(path, segmenter_model):
    model = stable_whisper.load_model("large-v2")
    result = model.transcribe(audio=path, verbose=True, regroup=False, word_timestamps=True, language="en")

    result._split_segments(get_indicies_autoembed, args=[segmenter_model, "cuda", 0.5])

    result_regrouped = (
        result
            .split_by_gap(.5)
    )

This basically replaces split_by_punctuation, with the increased ability to identify conjunctions, determiners, etc. Seems like it also gained the ability to control the length of each segment to be similar to those seen in training samples.

It wasn't able to identify gaps though, as time information was not fed in, so it is followed by a split_by_gap.

It is designed only to work as the first step in splitting and regrouping.

The model was trained to a quite usable state with three pieces of text, each around 1700 tokens.

2 replies

diabolo98 Nov 12, 2023

I have zero knowledge about how to load a model in torch, so I spent a few hours trying to understand how to use the segmenter model, but I did make it work by converting the train.py to be used for inference instead. For people who don't know to load the segmenter model Here's the code:

clone the repo from HuggingFace
make a file, call it inference.py or whatever and past this :

import torch
import os
import stable_whisper
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

if __package__ == None or __package__ == "":
    from model import BidirLSTMSegmenterWithEmbedding
    from utils import get_indicies_autoembed
    from model_consts import input_size, embedding_size, hidden_size, num_layers
else:
    from .model import BidirLSTMSegmenterWithEmbedding
    from .utils import get_upenn_tags_dict,get_indicies_autoembed
    from .model_consts import input_size, embedding_size, hidden_size, num_layers

device = "cuda"
audio_file = "path to audio file"

model = BidirLSTMSegmenterWithEmbedding(input_size, embedding_size, hidden_size, num_layers, device)

if os.path.exists("segmenter.ckpt") and os.path.isfile("segmenter.ckpt"):
  model.load_state_dict(torch.load("segmenter.ckpt"))
    
model.to(device)

def transcribe_audio_stablewhisper_segmenter(path, segmenter_model):
    whisper = stable_whisper.load_model("base")
    result = whisper.transcribe(audio=path, verbose=True, regroup=False, word_timestamps=True, language="en")
    result._split_segments(get_indicies_autoembed, args=[segmenter_model,"cuda", 0.5])
    result_regrouped = (
        result
            .split_by_gap(.5)
    )
    result_regrouped.to_srt_vtt('audio.srt')
    
transcribe_audio_stablewhisper_segmenter(audio_file,model)

Metric-Void Nov 12, 2023

Looks good. If you want to use this repo as a submodule of a project, you may also clone this repo to a subfolder segmenter under your project, and do this.

from segmenter.utils import get_indicies_autoembed
from segmenter.model import BidirLSTMSegmenterWithEmbedding
from segmenter.model_consts import input_size, embedding_size, hidden_size, num_layers

segmenter_model = BidirLSTMSegmenterWithEmbedding(input_size, embedding_size, hidden_size, num_layers)
segmenter_model.load_state_dict(torch.load("segmenter/segmenter.ckpt"))

# You may need some .to(device) here.

# Then, run Whisper
segments = transcribe_audio_stablewhisper_segmenter(filename, segmenter_model)

dgoryeo · 2023-11-12T15:28:39Z

dgoryeo
Nov 12, 2023

Hi @Metric-Void , thanks for sharing this. I tried to use the model in this way:


from transformers import AutoModel
import stable_whisper

def load_models():
    asr_model = stable_whisper.load_model("large-v2")
    segmenter_model = AutoModel.from_pretrained('metricv/metricsubs-segmenter')
    return asr_model, segmenter_model

def transcribe_audio_stablewhisper_segmenter(path, asr_model, segmenter_model):
    try:
        result = asr_model.transcribe(audio=path, verbose=True, regroup=False, word_timestamps=True, language="ja")
        result._split_segments(get_indicies_autoembed, args=[segmenter_model, "cuda", 0.5])

I get an error: config.json missing. Where do I go wrong?

2 replies

Metric-Void Nov 12, 2023

See the comment above from @diabolo98, that was the right way.

Also, I'm afraid my method does not work for Japanese, at least not out-of-the box. You'll need to find some tokenizer and part-of-speech tagger for Japanese. For English, the tokenizer was punkt, and POS-Tagger was averaged_perceptron_tagger. Japanese may need more complex tokenizer and tagger.

dgoryeo Nov 13, 2023

Thanks @Metric-Void for clear explanation. Much appreciated. Well I guess I should wait until Japnese becomes more researched.

Gpeschke · 2023-12-24T15:39:09Z

Gpeschke
Dec 24, 2023

Not all transcripts have them, but for some, whisper returns words that are capitalized- either due to being nouns or due to it being where it guesses from the audio that a sentence starts. Here's a little code that exploits that for grouping.

Ironically enough, stopping on nouns is ok for my use-case. But ideally someone with knowledge of nltk expands this to (optionally) exclude nouns.

import torch
import os
import stable_whisper
from typing import Union, List, Tuple, Optional, Callable
from stable_whisper.result import Segment
from itertools import chain

model = stable_whisper.load_model('small.en')
result = stable_whisper.WhisperResult('previous_run_result.json')
result.merge_all_segments()

def get_capitalization_indices(to_split):
    if not to_split.has_words:
        return []

    all_words = list(to_split.words)
    if len(all_words) < 2:
        return []
    segment_at = []
    for index, w in enumerate(all_words):
        if index == 0: continue
        print(w.word.strip()[0])
        if w.word.strip()[0].isupper():
            segment_at.append(index-1)

    print("get_capitalization_indices:")
    print(segment_at)
    return segment_at

def split_by_caps(
            to_split,
            lock=False,
            newline=False,
    ) -> 'WhisperResult':
        """
        Segment before words that are capitalized.

        Parameters
        lock : bool, default False
            Whether to prevent future splits/merges from altering changes made by this method.
        newline: bool, default False
            Whether to insert line break at the split points instead of splitting into separate segments.
        ----------

        Returns
        -------
        stable_whisper.result.WhisperResult
            The current instance after the changes.
        """

        to_split._split_segments(
            get_capitalization_indices,
            lock=lock,
            newline=newline
        )
        if to_split._regroup_history:
            to_split._regroup_history += '_'
        to_split._regroup_history += (f'+{int(lock)}+{int(newline)}')
        return to_split

result = split_by_caps(result)
result.to_srt_vtt('segment.srt', word_level=False, segment_level=True) #SRT

0 replies

McCloudS · 2024-03-22T17:54:02Z

McCloudS
Mar 22, 2024

I'm trying to replicate something similar to:

--standard: Quick hardcoded preset to split lines in standard way. 42 chars per 2 lines with max_comma_cent=70 and --sentence are activated automatically.

--sentence: Enables splitting lines to sentences for srt and vtt subs. Every sentence starts in the new segment. Be default meant to output whole sentence per line for better translations, but not limited to, read about '--max_...' parameters.

from Whisper-Standalone.

Is this a step in the right direction for the 42 chars per 2 lines?

cm_sl=84_sl=42++++++1

Eventually want to edit the default with the rules above. Something like :
cm_sl=84_sp=,* /，_sg=.5_mg=.3+3_sp=.* /。/?/？_sl=42++++++1?

1 reply

jianfch Mar 22, 2024
Maintainer Author

It will be at most 42 chars per line (so not exactly 42 if that's what you want). Even though sl=84 ensures each segments is at most 84 chars, following it with the default rule can make some lines go over 84 chars because of mg=.3+3 in the rule. So might consider removing mg=.3+3, using max_chars=84 and is_sum_max=True for mg=.3, or using lock=True and force_len=True for the sl=84.

If you want the segment lengths to be consistent down to word level, you'll need force_len=True which will merge all segments into one segment before splitting.

stable-ts/stable_whisper/result.py

Lines 1586 to 1588 in 424f484

    
                   force_len : bool, default False 
        
                       Whether to force a constant length for each segment except the last segment. 
        
                       This will ignore all previous non-locked segment boundaries.

So sl=84+++1 instead of sl=84.

Even then, if last word is 2 chars and the segment is 85 chars it will split at the 83th char to avoid splitting in-between a word. Note that even in languages with 1 char per "word", the words can still be longer than 1 char because the accompanying punctuations.

dgoryeo · 2024-03-25T14:50:38Z

dgoryeo
Mar 25, 2024

Has anyone found an "optimum" regrouping settings for Japanese transcriptions?

I have so far played with adjusting the values in default settings. I'm not good at this so I try to sail close to shore :)
For me so far this settings come close to what I expect the results should be:

--regroup 'cm_sp=,* /，_sg=.6_mg=.3+2_sp=.* /。/?/？'

I'm keen to hear your thoughts, and experiences.

0 replies

patermars · 2024-07-19T03:29:46Z

patermars
Jul 19, 2024

Can someone help me with merging punctuation by '/' (slash).

For e.g. : "others own rehearsal spaces/studios" , here I want to be able to have spaces/studios in 1 line itself, not different subtitles line.

I used result = model.align('story.mp3', text, language='en',regroup='cm_sp=,* /，_mp=/_sg=.5_mg=.3+3_sp=.* /。/?/？_mp=/') but it wont work at all.

I get my .srt file like this:

14
00:00:03,360 --> 00:00:03,520
small

15
00:00:03,520 --> 00:00:03,880
local

16
00:00:03,880 --> 00:00:04,360
community

17
00:00:04,360 --> 00:00:04,500
of

18
00:00:04,500 --> 00:00:04,820
musicians

19
00:00:04,820 --> 00:00:06,160
/promoters

20
00:00:06,160 --> 00:00:06,420
who

21
00:00:06,420 --> 00:00:06,500
are

I think code is assuming my '/' as

/ separates an argument into list of strings

and thus ignoring it as a whitespace.

Can someone help me with this??

3 replies

jianfch Jul 19, 2024
Maintainer Author

The / is treated as:

stable-ts/stable_whisper/result.py

Line 2318 in 5f1461f

/ separates an argument into list of strings

Instead of _mp=/, use merge_by_punctuation() after align().

result.merge_by_punctuation("/")

patermars Jul 19, 2024

Doesn't work buddy. Just checked, it's creating the srt file as before.

I tried implementing that before too but whatever I do, it wont merge words with '/'.

My code:

import stable_whisper

with open("story.txt", 'r') as file:
            text = file.read()

model = stable_whisper.load_model("base.en")

result = model.align('story.mp3', text, language='en',regroup=False)
(
    result
    .clamp_max()
    .split_by_punctuation([(',', ' '), '，'])
    .split_by_gap(.5)
    .merge_by_gap(.3, max_words=3)
    .split_by_punctuation([('.', ' '), '。', '?', '？'])
)
result.merge_by_punctuation("/")

result.to_srt_vtt("story3.srt", segment_level=False, word_level=True)

I got the same output :

18
00:00:04,500 --> 00:00:04,820
musicians

19
00:00:04,820 --> 00:00:06,160
/promoters

jianfch Jul 19, 2024
Maintainer Author

It might fall under one of these cases:
If / is its own word, use result.merge_by_punctuation([('/', ''), ('', '/')]).
If there is a space before /, use result.merge_by_punctuation([(' ', '/')]). Note the space is stripped by default when converted to SRT.
If both applies then use result.merge_by_punctuation([(' /', ''), ('', ' /')]).
You can check which one applies by printing the word with print(result[18][0]) (SRT starts at index 1).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharing Customized Regrouping Algorithms #162

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 8 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Sharing Customized Regrouping Algorithms #162

jianfch Jun 9, 2023 Maintainer

Replies: 7 comments · 8 replies

jianfch Mar 22, 2024 Maintainer Author

jianfch Jul 19, 2024 Maintainer Author

jianfch Jul 19, 2024 Maintainer Author

jianfch
Jun 9, 2023
Maintainer

Replies: 7 comments 8 replies

jianfch Mar 22, 2024
Maintainer Author

jianfch Jul 19, 2024
Maintainer Author

jianfch Jul 19, 2024
Maintainer Author