Improve default line lengths in subtitle files #314

brainwane · 2022-10-13T20:51:49Z

brainwane
Oct 13, 2022

It's very cool that Whisper can emit a .srt subtitle file!

For English, it's best to keep subtitles to 42 characters per line, as Amara.org suggests and as Netflix also suggests (see those pages for suggestions for several other languages). That ensures that the subtitles will render reasonably well on most displays.

The .srt I just got from a Whisper run had some lines that were 50-60 characters long, which is longer than some displays will render well; some of the beginning or end of the line is likely to be cut off. For example, in this .srt:

564
00:29:48,800 --> 00:29:52,240
If I don't do that, that might make a difference

565
00:29:52,240 --> 00:29:54,920
between the thing happening and the thing not happening,

566
00:29:54,920 --> 00:30:00,600
or worse, between someone feeling included

567
00:30:00,600 --> 00:30:04,920
and someone feeling like open source has no place for them.

the second line is 56 characters long and the fourth is 59 characters long.

I'd love for write_srt to break up lines appropriately, so that instead of always emitting a single line for a timestamp, it sometimes breaks up the subtitle into 2-3 lines.

whisper/whisper/utils.py

Line 63 in 02b7430

def write_srt(transcript: Iterator[dict], file: TextIO):

When deciding where to split a line of text, per Amara's guidance,

Keep grammatical units together. Read each line to make sure that you do not split up meaningful phrases (for example infinitives, prepositional phrases).

I don't know as much about the characters-per-line conventions for .vtt files but perhaps a similar approach could be used to improve those as well.

brainwane · 2022-10-27T15:42:41Z

brainwane
Oct 27, 2022
Author

The BBC requests line length not exceed 37 characters and offers guidance on how and when to segment lines.

0 replies

Uidiz · 2022-10-29T13:30:58Z

Uidiz
Oct 29, 2022

So have you managed to find a way to break up the lines of the srt file?

4 replies

brainwane Oct 29, 2022
Author

I looked for an automated open source solution and couldn't find one. Right now I'm loading the file into GNOME Subtitles and manually splitting subtitles into multiple lines per subtitle.

nshmyrev Oct 29, 2022

You can check https://github.com/SubtitleEdit/subtitleedit/

Uidiz Oct 30, 2022

Thank you, been thinking about writing a script myself

Aynurbalci Feb 13, 2023

is it work?

ksn-systems · 2022-10-29T21:35:00Z

ksn-systems
Oct 29, 2022

Hi

Interestingly your SRT sample is showing values other than ,000 ms in the timing. What command line options are you using to transcribe and create the SRT files? I am assuming you are using the CLI tool ?

Subtitle accuracy is something I am interested in.

Also in my tests, the first registered word is logged at 00:00 even through it's 5 seconds into the audio file

Many thanks

Darren B.

0 replies

ksn-systems · 2022-10-30T21:00:46Z

ksn-systems
Oct 30, 2022

HI

Here's a fragment what i did (in js)

r[2] just contains the texts as a string and data contains the growning content of my vtt output it only allows for breaking the text once.

rough but it works for testing

if (r[2].toString().length > 42)
{
let words = r[2].trim().split(" ");
const half = Math.ceil(words.length / 2);
data += words.slice(0, half).join(' ') + "\n";
data += words.slice(half).join(' ') + "\n";
}
else
{
data += r[2].trim() + "\n"; //add the single line
}

0 replies

Baenwort · 2022-12-11T21:09:53Z

Baenwort
Dec 11, 2022

It would be nice to have this be either built into the .srt output function or allow another variable that would define how many characters in a line and how many lines to display at once.

0 replies

rBrenick · 2022-12-29T22:08:12Z

rBrenick
Dec 29, 2022

Took a stab at implementing this on a fork here: rBrenick@6f2e2aa

Added an optional parameter called max_line_length set to 42 by default.

Did not do any clever grammatical analysis to keep groups of words together, just simple split by length.

However, as I realise after writing this. It might be more suited as a post-processing step for existing .srt files. Since it might be considered outside the scope of this repository. So I've ported to a quick standalone script as well.

2 replies

mayeaux Jan 24, 2023

Took a stab at implementing this on a fork here: rBrenick@6f2e2aa

Added an optional parameter called max_line_length set to 42 by default.

Did not do any clever grammatical analysis to keep groups of words together, just simple split by length.

However, as I realise after writing this. It might be more suited as a post-processing step for existing .srt files. Since it might be considered outside the scope of this repository. So I've ported to a quick standalone script as well.

whisper.cpp has this functionality, would be nice if Whisper does as well, I like the API implemented here:

    parser.add_argument("--max_line_length", type=optional_int, default=42, help="max amount of characters for a line in the subtitle files")

Would be nice if this could be shipped as an optional argument

dgoryeo Jan 27, 2023

Hi @mayeaux , this a very good approach. Post processing will always be helpful I think. Would you consider to to add to the stand alone script additional functionality to:
-(a) remove halucination side effect: repetitive words
-(b) correct silence/noise side effect: correct timelines for text with too long duration

Add the moment I do both (a) and (b) manually by using SubtitleEdit --I'm sure many people like me who don't know script language vey well have that problem. For (b) what I have learned is that Whisper timestamp for the end of the dialogue is usually quite accurate. Whisper usually gets the start wrong. So if the script can check if a line has too long duration (usually longer than 8 seconds), then it can correct the duration by calculating a standard duration and correct the start time (while keeping end-time as is). The standard duration can be calculated like: number of characters * (chars/sec speed). Subtitle edit default duration is 15 chars/sec.

Thanks!

glangford · 2023-01-24T19:47:32Z

glangford
Jan 24, 2023

With current whisper, a challenge is that if some .srt line is too long and we need to remove two words from it, for example, there is no obvious way of accurately adjusting the timestamps at the end of that line or for the start of the next line.

When word level timestamps are released, this is a much easier problem to solve (apart from the "don't split up meaningful phrases" guidance) and I think it would make sense to include in whisper itself.

word-level timestamps in transcribe() #869

5 replies

mayeaux Mar 23, 2023

Now that this is shipped do you think the idea of adding a maximum number of characters per line while still outputting accurate timestamps is more feasible?

glangford Mar 23, 2023

It is definitely more feasible - see the video and .srt example in #1072: currently the .srt files produced with word_timestamps True have underscores that appear, timed for each word. At the moment, it looks like it would be easiest to produce an entirely new .srt from the .json file, applying rules such as line length. Each word is individually time stamped within the .json, however the bigger issue of overall time sync may still exist (eg. #1077)

mayeaux Mar 26, 2023

I am actually just finishing up this code. Once I'm done I'll post it as a Gist, it was relatively easy in the end.

makseem77 Jul 3, 2023

hello, did you managed to make it work ? I wonder if I could do the convertion from the .json to a .vtt file.

ClaireCJS Jul 6, 2023

I'm wondering too! :)

ubanning · 2023-02-13T01:41:17Z

ubanning
Feb 13, 2023

Any news of accomplishing this without doing a line break?
Thanks

0 replies

Aynurbalci · 2023-02-13T12:32:41Z

Aynurbalci
Feb 13, 2023

Hello OpenAl,
I'm working on your whisper project. I want to separates text lines. What can i do? Can you help me?
I'm waiting your answer.
Thanks.

0 replies

p4-k4 · 2023-03-06T11:13:54Z

p4-k4
Mar 6, 2023

Have resorted to creating an additional untility to process the .srt file post-whisper.

Turns this:

00:00:00,000 --> 00:00:09,600
For the next week, Mount Maunganui will be home to New Zealand's first ever Wingfoil

To this:

1
00:00:00,000 --> 00:00:09,600
For the next week, Mount Maunganui will be
home to New Zealand's first ever Wingfoil

/// Takes a `String`, caps line length by `lineCharLimit` and returns a `Subtitle`.
List<Subtitle> fromSRT(String srtContent, {lineCharLimit = 50}) {
  try {
    final regex = RegExp(
      r'(\d+)\n'
      r'(\d{2}:\d{2}:\d{2},\d{3}) --> (\d{2}:\d{2}:\d{2},\d{3})\n'
      r'((?:[^\n]+\n?)+)',
      dotAll: true,
    );
    final matches = regex.allMatches(srtContent);
    return matches.map((match) {
      final index = int.parse(match.group(1)!);
      final startTime = parseDuration(match.group(2)!);
      final endTime = parseDuration(match.group(3)!);
      var text = match.group(4)!.trim();
      if (text.length > lineCharLimit) {
        final _splitText = splitText(lineCharLimit, text);
        text = _splitText.join('\n');
      }
      return Subtitle(
        index: index,
        startTime: startTime,
        endTime: endTime,
        text: text,
      );
    }).toList();
  } catch (e) {
    rethrow;
  }
}

One consideration though is that before we even consider line breaks, it would be ideal to specify a limit on the maximum duration of a subtitle before whisper generates the proceeding subtitle. From there, the ability to also specify the maximum number of line breaks, followed by the character limit as this topic discussed.

Not directly related to this discussion, but additionally, the ability to join gaps that are less than a specified duration In cases where the time between subtitles is so short, that it's not worth allowing a gap between them and rather have them "butted up" to eachother. Currently we're seeing continuous (gapless) segments which I assumed is WIP currently.

2 replies

brendan-jarvis Mar 24, 2023

Kia ora! Interesting question, I checked the BBC Subtitle Guidelines to see if it could help here.

According to the BBC Subtitle Guidelines, the recommended subtitle speed is 160-180 words-per-minute (WPM) or 0.33 to 0.375 seconds per word. However, viewers tend to prefer verbatim subtitles, so the rate may be adjusted to match the pace of the program.

There are some other exceptions, for example keeping punchlines separate from the preceding text. In practice subtitle timing can be summarised as being about editorial decisions.

But for the purposes of automatic subtitling, we have the following constraints from the BBC Subtitle Guidelines:

Maximum line length of 37 characters,
0.33 to 0.375 seconds per word.
Avoid 3 lines or more.

p4-k4 Mar 24, 2023

Tēnā koe! Both options of WPM/S and character length would be useful. In my case I've found that some editors (in broadcast) either are clueless or are lazy to conform to title/graphic safe areas (you'd be surprised it still happens in "mainstream" media").

In my case, I needed a way to limit character length for burned in subtitles in order not to obstruct burned in graphics/lower-thirds but maintain font size.

I've actually implemented all of what I mentioned above with the ability to limit maximum lines. It also uses word-level timestamps to generate somewhat improved start/end timestamps for segments. Currently weighing up options for diarization. This is all ok as post-processing but would be ideal to have WPM/S constraints already processed.

Have ended up jumping down the rabbit hole and created dart bindings for whisper.cpp - not expecting to see any of these "fine details" implemented here anytime soon.

Aynurbalci · 2023-03-23T17:18:27Z

Aynurbalci
Mar 23, 2023

Unfortunately my channel closed :(

…

On 23 Mar 2023 Thu at 20:06 mayeaux ***@***.***> wrote: Now that this is shipped do you think the idea of adding a maximum number of characters per line while still outputting accurate timestamps is more feasible? — Reply to this email directly, view it on GitHub <#314 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/APNQ2E2QQ57UZHPL7Y23DATW5R7JLANCNFSM6AAAAAAREUYEPI> . You are receiving this because you commented.Message ID: ***@***.***>

0 replies

glangford · 2023-03-30T11:55:44Z

glangford
Mar 30, 2023

Following the approach whisper takes in defining a writer for text, subtitles, etc. here is a class for creating subtitles with a maximum line length and number of lines. Appreciate any comments if you are able to try it out. @jongwook would you consider something like this in a pull request?

Example usage using a .json file generated with word_timestamps set to True. You can also give it a result like other writers.

import json
writer = SubtitlesWriterTimed(max_line_count=2, max_line_length=42)
js = open("test.json")
jsdata = json.load(js)
with open("test.srt", "w") as f:
    writer.write_result(jsdata, f)

The class

from whisper.utils import SubtitlesWriter
from typing import TextIO

class SubtitleBlock:
    def __init__(self, max_line_count, max_line_length):
        self.max_line_count = max_line_count
        self.max_line_length = max_line_length
        self.block_start = 0
        self.block_end = 0
        self.block = [""]
        self.line = 0

    def add_word(self, word_timed):
        word = word_timed["word"]
        if not self.block_start:
            self.block_start = word_timed["start"]
        self.block_end = word_timed["end"] # provisional end time
        if len(self.block[self.line]) + len(word) > self.max_line_length and word[0] != "-": # don't split hyphenated words over lines
            self.line += 1
            self.block.append("")
        self.block[self.line] += word

    def is_complete_before(self, word_timed) -> bool:
        """Indicate if the upcoming word won't fit and a new block should be started"""
        word = word_timed["word"]
        max_length = len(self.block[self.line]) + len(word) > self.max_line_length  
        no_tail_hyphenation = word[0] != "-"
        max_lines = self.line + 1 >= self.max_line_count
        return max_length and no_tail_hyphenation and max_lines
    
    def do_yield(self, formatter):
        text = "\n".join([line.strip() for line in self.block])
        yield formatter(self.block_start), formatter(self.block_end), text

class SubtitlesWriterTimed(SubtitlesWriter):
    """Write an .srt file after transcribing with word_timestamps enabled, 
       imposing a maximum line length and number of lines per entry. 
    """
    always_include_hours = True
    decimal_marker = ","   
    
    def __init__(self, max_line_count=1, max_line_length=42):
        self.max_line_count = max_line_count
        self.max_line_length = max_line_length
       
    def iterate_result(self, result: dict):
        block = SubtitleBlock(self.max_line_count, self.max_line_length)
        for segment in result["segments"]:
            for word_timed in segment["words"]: # .word, .start, .end
                if block.is_complete_before(word_timed):
                    yield from block.do_yield(self.format_timestamp)
                    block = SubtitleBlock(self.max_line_count, self.max_line_length)
                block.add_word(word_timed)
        yield from block.do_yield(self.format_timestamp)

    def write_result(self, result: dict, file: TextIO):
        for i, (start, end, text) in enumerate(self.iterate_result(result), start=1):
            print(f"{i}\n{start} --> {end}\n{text}\n", file=file, flush=True)

this implementation can allow lines to go over the line limit in limited cases if it would cause hyphenated words to be broken up (implementation choice)
a future addition could be named entity recognition to preserve entities in a line

3 replies

ryanheise Apr 2, 2023

        text = "\n".join([line.strip() for line in self.block])

In my implementation I think I started along the same lines (no pun intended ;-) ), but after integrating word highlighting I realised it was more practical to represent a subtitle as a 1D list of words that embed "\n" and " " prefixes as appropriate. PR #1184 demonstrates the approach.

AZT143 Apr 3, 2023

how to use this code? I'm not very sophisticated with coding or such functionalities, do i integrate it somewhere (if so how), or do i paste this code together with the code that runs whisper?

brendan-jarvis Apr 4, 2023

how to use this code? I'm not very sophisticated with coding or such functionalities, do i integrate it somewhere (if so how), or do i paste this code together with the code that runs whisper?

Assuming you've already cloned the Whisper repo and installed the dependencies, you can just checkout the pull request:

Fetch PR Implement max line width and max line count, and make word highlighting optional #1184 to a local git branch 1184 with
git fetch origin pull/1184/head:1184
Then checkout the pull request local branch using
git checkout 1184
Then it can be used, for example:
python -m whisper tests/jfk.flac --model base.en --max_line_count 2 --max_line_width 42 --word_timestamps True

glangford · 2023-04-03T20:05:11Z

glangford
Apr 3, 2023

I am reading the BBC Subtitle Guidelines mentioned in this thread, and wanted to point out a key aspect of the maximum line length recommendation.

in discussing line length (3.1) the 37 character recommendation only applies to broadcast platforms. For online, the recommendation is "68% of the width of a 16:9 video and 90% of the width of a 4:3 video...the number of characters that generate this width is determined by the font used"

Also, as we already know

"Each subtitle should comprise a single complete sentence"
"A maximum subtitle length of two lines is recommended"

So fixing line length to 37 characters, for example, is not correct in general for video consumed online. The other implication is that ideally we would be able to segment sentences multi-lingually. I had a quick look at this in spaCy and NLTK. In the end, reliably segmenting sentences integrated with whisper means introducing a dependency on an NLP library.

https://www.bbc.co.uk/accessibility/forproducts/guides/subtitles/

7 replies

glangford Apr 7, 2023

For clarity, here is a more challenging example where the endpoints of clauses (detected by spaCy, simple debug output below) could be considered as breakpoints for a subtitle. I am sure there are many difficult challenge examples to test with but it seems to be at least worth investigating further.

One end is coated with a material that can be ignited by frictional heat generated by striking the match against a suitable surface.

clause triggered by: with
 with a material that can be ignited by frictional heat generated by striking the match against a suitable surface
clause triggered by: ignited
 that can be ignited by frictional heat generated by striking the match against a suitable surface
clause triggered by: by
 by frictional heat generated by striking the match against a suitable surface
clause triggered by: generated
 generated by striking the match against a suitable surface
clause triggered by: by
 by striking the match against a suitable surface
clause triggered by: against
 against a suitable surface

So there are a number of options to consider as break points.

Similarly,
When I jumped on the bus I saw the man who had taken the basket from the old lady.

conj_break triggered by: When
clause triggered by: jumped
    When I jumped on the bus
clause triggered by: on
    on the bus
clause triggered by: taken
    who had taken the basket from the old lady
clause triggered by: from
    from the old lady

ryanheise Apr 8, 2023

I'm not debating by any means, I was just providing relevant information. As far as I'm aware there is still no constituency parser, just the dependency parser. You can build language specific algorithms on top of the dependency parser as I've suggested, but if you're actually using a different approach, I would be interested for you to share what the API is that you found in the latest spaCy, even if it is an English only API.

ryanheise Apr 8, 2023

Just checked for external add ons, and there is actually a 3rd party constituency parser here:

https://github.com/nikitakit/self-attentive-parser

Edit: Ah I remember now why I had put this option onto the back burner in the past, it doesn't support all the languages I needed. There is an open discussion at spaCy tracking the feature request to fully support constituency parsing within spaCy here. There are linked algorithms for converting a dependency parse tree into a constituency parse tree.

glangford Apr 11, 2023

Thanks for these links, I am not an expert at all in this area but so far it looks like spaCy on its own could suffice. Work to do still but a combination of part-of-speech with dependency and the Matcher construct works well. Maybe as a custom pipeline stage that annotates a document with candidate grammatical break points, verbal pauses and start/end word timing? Not sure of the possible downstream uses, this is just exploratory for me.

ryanheise Apr 11, 2023

I think it's a good problem to solve, maybe it could be a reusable python package that can be added to other projects as a dependency. Although it's also a non-trivial problem. One thing you'll find is that the problem is not really generalisable across all languages. For example, much of Japanese word order is the reverse of English and this completely changes the way speakers of each language naturally chunk their sentences. Not to mention, spaCy unfortunately doesn't use universal dependencies for all languages (from memory, German uses TIGER instead of UD).

glangford · 2023-05-13T16:15:46Z

glangford
May 13, 2023

I looked for an automated open source solution and couldn't find one. Right now I'm loading the file into GNOME Subtitles and manually splitting subtitles into multiple lines per subtitle.

@brainwane Is the latest whisper (with line length and number of lines options) solving this problem for you? If not, I have a preliminary approach which separates the text into individual sentences, and chooses line breaks gramatically where possible (broadly following the Netflix guidelines). It also realigns the subtitle timing.

Still a lot of testing to do on it - if you have a word timestamped .json (English only for now) I can generate an .srt if you want to give it a try.

2 replies

darnn May 14, 2023

@glangford I would, if that's alright. It doesn't really help me that it's implemented in Whisper itself since I mostly use this: https://github.com/Const-me/Whisper/

So a standalone program that would be able to convert the word timestamped json file into a properly split srt would be great.

glangford May 14, 2023

@darnn Sure, it is standalone python (using spaCy), and the idea is to identify grammatical breaks in long sentences, but fall back to a random break that fits the line length if necessary. If you share a json here, and your desired maximum number of lines and maximum line width, I will send the .srt back. Appreciate any feedback on the result.

padraigohoulihan · 2023-05-14T11:32:49Z

padraigohoulihan
May 14, 2023

I'm an absolute novice so sorry for the stupidity of this question but.. I'm seeing in the thread it looks that Whisper has been updated with the functionality to specify maximum line widths.

I've been using the collab example (because I have no idea how to code or anything), does THAT have the new functionality in it? And if so how do I use that function? At the moment I just run all of this in LibriSpeech.ipynb, is there a line of code I can add in that will make it export with max line widths as specified?

Again, sorry if all of this is intensely stupid.

! pip install git+https://github.com/openai/whisper.git
! pip install jiwer

import os
import numpy as np

try:
import tensorflow # required in Colab to avoid protobuf compatibility issues
except ImportError:
pass

import torch
import pandas as pd
import whisper
import torchaudio

from tqdm.notebook import tqdm

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

!whisper "episode.mp3" --model medium

4 replies

glangford May 14, 2023

To get a handle on all of the whisper options, you can use something like the following in a new 2 line notebook:

! pip install git+https://github.com/openai/whisper.git
!whisper --help

You can set
--word_timestamps True
along with your desired maximums, for example
!whisper "episode.mp3" --model medium --word_timestamps True --max_line_width 42 --max_line_count 2

padraigohoulihan May 15, 2023

Thank you for your reply! For some reason it still seems to be generating up to ~ 80 characters in the output.

Pasted what I've put in below, am I doing something wrong? Note I've excluded all the option explanations from the first cell for brevity.

FIRST CELL

! pip install git+https://github.com/openai/whisper.git
!whisper --help

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/openai/whisper.git
Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-hytc6so7
Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-hytc6so7
Resolved https://github.com/openai/whisper.git to commit 248b6cb
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: triton==2.0.0 in /usr/local/lib/python3.10/dist-packages (from openai-whisper==20230314) (2.0.0)
Requirement already satisfied: numba in /usr/local/lib/python3.10/dist-packages (from openai-whisper==20230314) (0.56.4)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from openai-whisper==20230314) (1.22.4)
Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (from openai-whisper==20230314) (2.0.0+cu118)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from openai-whisper==20230314) (4.65.0)
Requirement already satisfied: more-itertools in /usr/local/lib/python3.10/dist-packages (from openai-whisper==20230314) (9.1.0)
Collecting tiktoken==0.3.3 (from openai-whisper==20230314)
Downloading tiktoken-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 51.6 MB/s eta 0:00:00
Requirement already satisfied: regex>=2022.1.18 in /usr/local/lib/python3.10/dist-packages (from tiktoken==0.3.3->openai-whisper==20230314) (2022.10.31)
Requirement already satisfied: requests>=2.26.0 in /usr/local/lib/python3.10/dist-packages (from tiktoken==0.3.3->openai-whisper==20230314) (2.27.1)
Requirement already satisfied: cmake in /usr/local/lib/python3.10/dist-packages (from triton==2.0.0->openai-whisper==20230314) (3.25.2)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from triton==2.0.0->openai-whisper==20230314) (3.12.0)
Requirement already satisfied: lit in /usr/local/lib/python3.10/dist-packages (from triton==2.0.0->openai-whisper==20230314) (16.0.3)
Requirement already satisfied: llvmlite<0.40,>=0.39.0dev0 in /usr/local/lib/python3.10/dist-packages (from numba->openai-whisper==20230314) (0.39.1)
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from numba->openai-whisper==20230314) (67.7.2)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch->openai-whisper==20230314) (4.5.0)
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch->openai-whisper==20230314) (1.11.1)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch->openai-whisper==20230314) (3.1)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch->openai-whisper==20230314) (3.1.2)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.26.0->tiktoken==0.3.3->openai-whisper==20230314) (1.26.15)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.26.0->tiktoken==0.3.3->openai-whisper==20230314) (2022.12.7)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.10/dist-packages (from requests>=2.26.0->tiktoken==0.3.3->openai-whisper==20230314) (2.0.12)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.26.0->tiktoken==0.3.3->openai-whisper==20230314) (3.4)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch->openai-whisper==20230314) (2.1.2)
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch->openai-whisper==20230314) (1.3.0)
Building wheels for collected packages: openai-whisper
Building wheel for openai-whisper (pyproject.toml) ... done
Created wheel for openai-whisper: filename=openai_whisper-20230314-py3-none-any.whl size=798075 sha256=7823b2d6def3742e8e11a75335cb5e91d55e4a1e6336cefe534c8bd40373e507
Stored in directory: /tmp/pip-ephem-wheel-cache-nvsh7pqb/wheels/8b/6c/d0/622666868c179f156cf595c8b6f06f88bc5d80c4b31dccaa03
Successfully built openai-whisper
Installing collected packages: tiktoken, openai-whisper
Successfully installed openai-whisper-20230314 tiktoken-0.3.3
usage: whisper
[-h]
[--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large}]
[--model_dir MODEL_DIR]
[--device DEVICE]
[--output_dir OUTPUT_DIR]
[--output_format {txt,vtt,srt,tsv,json,all}]
[--verbose VERBOSE]
[--task {transcribe,translate}]
[--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}]
[--temperature TEMPERATURE]
[--best_of BEST_OF]
[--beam_size BEAM_SIZE]
[--patience PATIENCE]
[--length_penalty LENGTH_PENALTY]
[--suppress_tokens SUPPRESS_TOKENS]
[--initial_prompt INITIAL_PROMPT]
[--condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT]
[--fp16 FP16]
[--temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK]
[--compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD]
[--logprob_threshold LOGPROB_THRESHOLD]
[--no_speech_threshold NO_SPEECH_THRESHOLD]
[--word_timestamps WORD_TIMESTAMPS]
[--prepend_punctuations PREPEND_PUNCTUATIONS]
[--append_punctuations APPEND_PUNCTUATIONS]
[--highlight_words HIGHLIGHT_WORDS]
[--max_line_width MAX_LINE_WIDTH]
[--max_line_count MAX_LINE_COUNT]
[--threads THREADS]
audio
[audio ...]

positional arguments:
audio
audio
file(s) to
transcribe

SECOND CELL

!whisper "Test Episode.mp3" --model medium --word_timestamps True --max_line_width 10 --max_line_count 1

100%|██████████████████████████████████████| 1.42G/1.42G [00:11<00:00, 128MiB/s]
/usr/local/lib/python3.10/dist-packages/whisper/transcribe.py:114: UserWarning: FP16 is not supported on CPU; using FP32 instead
warnings.warn("FP16 is not supported on CPU; using FP32 instead")
Detecting language using up to the first 30 seconds. Use --language to specify the language
Detected language: English
[00:00.000 --> 00:12.060] lovely and what's your name a second alpha alpha alpha your name is alpha wow
[00:12.060 --> 00:14.560] you must be so manly
[00:15.220 --> 00:20.460] what's your name on the alpha in here literally wow okay cool and who you here

glangford May 15, 2023

@padraigohoulihan The timed text whisper is displaying here is just to show progress in the transcription. The line width and line count options apply to the subtitle file (.srt) which is generated at the end.

You can turn off the display of this text by changing the --verbose flag.

padraigohoulihan May 15, 2023

Agh got it. Thank you so much!! Really appreciate your help

darnn · 2023-05-14T12:06:06Z

darnn
May 14, 2023

Okay, here's a json file:
Theban.Plays.Antigone.aac.words.json.zip
And here's an srt file that I split, corrected and timed by hand:
Theban.Plays.Antigone_en 2.srt.txt

7 replies

glangford May 14, 2023

Ah...I see what's going on. @darnn I glossed over your comment that you use https://github.com/Const-me/Whisper/. This is a port of the Whisper .cpp implementation, and it does not produce the same output as true whisper. The json is different and the individual words appear to have whitespace removed. Not something my code can digest for now.

darnn May 14, 2023

That json is actually from Whisper-Timestamped, IIRC, but the point stands. I'll be able to produce one with the latest version of Whisper, but only in about a day. My GPU can't run the large model with vanilla Whisper, and the medium one would take several hours, so I'd have to leave it running overnight.

glangford May 14, 2023

Understood. If you want to do something short and sweet without having to run overnight, that works too.

darnn May 14, 2023

It was only after I did this that I realized your tool might not support Hebrew. But here goes, anyway:
efrat3fw.json.txt

glangford May 15, 2023

It was only after I did this that I realized your tool might not support Hebrew.

English only for the moment unfortunately. Also need a maximum line length (default 42) and number of lines (default 2).

rexsateesh · 2023-10-10T11:11:19Z

rexsateesh
Oct 10, 2023

This piece of code working fine for me.

import whisper
from whisper.utils import get_writer 

audio = './audio.mp3'
model = whisper.load_model(model='small')
result = model.transcribe(audio=audio, language='en', word_timestamps=True, task="transcribe")

# Set VTT Line and words width
word_options = {
    "highlight_words": False,
    "max_line_count": 1,
    "max_line_width": 42
}
vtt_writer = get_writer(output_format='vtt', output_dir='./')
vtt_writer(result, audio, word_options)

0 replies

dev-vinc · 2023-11-05T17:23:17Z

dev-vinc
Nov 5, 2023

@glangford are you still working on that approch that would split lines gramatically where possible? I'm using Whisper for my studies and I notice (regrettably) while using --word_timestamps True --max_line_width 42 --max_line_count 2 a lot of truncated sentences in the output, where the next line contains just the last or the last two words of a sentence before a full stop. If you found a solution to this kind of problem, I would be really grateful to hear it.

10 replies

yavorminchev Feb 6, 2024

@glangford I'm also interested in that because I couldn't find a solution online for grammatically correctly formatted subtitles from Whisper.

glangford Feb 7, 2024

Below is a link to a Gist for creating grammatically separated subtitles.
It is probably overkill for most content, and I would encourage you to evaluate simpler solutions first. :)

Requirements:

install spaCy NLP and an appropriate language model (https://spacy.io) - default model is en_core_web_lg
whisper output with word timestamps=True saved into a .json
Python 3.10+

Features

creates an .srt using spaCy identified punctuation, parts of speech and phrases where possible
can set desired number of subtitle lines and line width
inspired by the guidance given by BBC and Netflix for subtitling
tested in English but could be adapted for other languages
can import an optional data file to recognize place names, specialized commercial terminology, etc. (named entities)

Minimal invocation
python3 -m subwisp input.json >output.srt

I emphasize that this is a prototype that was developed to see what was possible to create leveraging NLP. You won't likely need something this complex.

That said, it has worked well in a wide variety of different videos and it does improve readability of subtitles.

https://gist.github.com/glangford/a2b24ffd92c832c60e1b1b49da1a8b27

yavorminchev Feb 9, 2024

Thanks a lot for sharing, @glangford. I've tested it and it works very well, great job!

In regards to subtitle timing, I've noticed that this approach follows strictly the speech and ends a subtitle segment as soon as the spoken sentence ends. That could be a desired style, but generally subtitles flow more unobtrusively if a segment remains on screen before transitioning to the next sentence, should that pause not exceed a certain lenght. Due to my very limited understanding of code, I'm wondering if it would be feasable to add a parameter for that and also a parameter for consistent minimum gap between the subtitles? Maybe even minimum and maximum subtitle duration might be useful, though that goes beyond the scope of your objective.

Also, would it be easy to add an option for outputting .vtt files?

Lastly, would there be any benefits to using the transformer-based SpaCy models? From my tests there was barely any difference between those and the large models.

dev-vinc Feb 9, 2024

Cool stuff! My "solution" was much more naive, I posted it here https://github.com/dev-vinc/whisper-json-pharser/

glangford Feb 9, 2024

In regards to subtitle timing, I've noticed that this approach follows strictly the speech and ends a subtitle segment as soon as the spoken sentence ends.

Yes. It realigns subtitle blocks based solely on the word timestamps.

I don't have a need to develop further upgrades but to answer your questions - it would be feasible to extend display times/gaps according to some policy, for a number of milliseconds, but I'm not clear on how that would work. I don't know whether it would apply to all sentences, or what the specific rules should be. Enforcing minimum display times, words per minute, etc. was outside the scope.

Also, would it be easy to add an option for outputting .vtt files?

I don't use .vtt and I hadn't planned on adding support for it.

Would there be any benefits to using the transformer-based SpaCy models?

Good question, I haven't tried any of the transformer based models and I'm not up to speed on how they compare to the large models.

amolinasalazar · 2023-11-06T23:51:57Z

amolinasalazar
Nov 6, 2023

Hi, from version v20231105 there is an extra option called --max_words_per_line to fix a maximum number of words per subtitle line. You can check the PR for more details and make your own tests.

From my experience, subtitles generated using this option are more pleasant comparing with the results I obtained using --max_line_width. Depending on the length of the words, subtitles can be longer or smaller, but you can expect always a maximum amount of words per line. Additionally, --max_words_per_line will respect end of the segments, so when a sentence finish and there is a small gap of time to start the next one, subtitle lines won't join the end with the start and they will keep less time hanging in the screen. That means that finishing lines can have less words than the number you set, like:

For --max_words_per_line 3

1
00:00:03,820 --> 00:00:05,380
One two three.

2
00:00:06,300 --> 00:00:07,120
One two three.

3
00:00:07,120 --> 00:00:07,840
One two three

4
00:00:07,840 --> 00:00:08,300
four.

Code example

import whisper
from whisper.utils import get_writer

model = whisper.load_model('base')
audio = whisper.load_audio("test.wav")
result = whisper.transcribe(model, audio, language="en", word_timestamps=True)

srt_writer = get_writer("srt", ".")
srt_writer(result, "test.wav", {"max_words_per_line":3})

Still if you want to have a better prediction of the subtitles lenght, the usage of --max_line_width is a better choice.

Notice that you need word_timestamps=True to make it work.

2 replies

smit-io Apr 6, 2024

This is a good solution

Francoyy May 27, 2024

Personally this solution doesn't work well for me. It always tries to maximize the amount of words in each line, even when it should cut earlier to respect the natural break of sentences.

doggy8088 · 2024-03-04T07:06:09Z

doggy8088
Mar 4, 2024

Keep subtitles to 42 characters per line seems too few. Because some speakers talk fast, it will be difficult to read the subtitles.

1 reply

ryanheise Mar 4, 2024

Keep subtitles to 42 characters per line seems too few. Because some speakers talk fast, it will be difficult to read the subtitles.

The 42 character line limit is in reference to Netflix's requirements which only reference the limit of characters within a line, but there are also typically two lines per subtitle and thus up to 84 characters per subtitle. Whisper has command line options to allow you to set the maximum number of characters per line and also the maximum number of lines per subtitle, so you can adjust these as desired. For example, for CJK languages which have wider, more dense characters, the 42 character limit would be approximately half that.

ksn-systems · 2024-03-04T07:31:02Z

ksn-systems
Mar 4, 2024

The original teletext standard was a max of 32 chars per line. 42 chars, including white space chars is a good option and works well for TV, both broadcast and online.

On Mon, Mar 4, 2024, at 5:06 PM, Will 保哥 wrote: Keep subtitles to 42 characters per line seems too few. Because some speakers talk fast, it will be difficult to read the subtitles. — Reply to this email directly, view it on GitHub <#314 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AORH3E6KC33K5X2ET2JIAHTYWQMO7AVCNFSM6AAAAAAREUYEPKVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DMNRSHA2DC>. You are receiving this because you commented.Message ID: ***@***.***>

This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.

0 replies

smit-io · 2024-04-06T23:38:40Z

smit-io
Apr 6, 2024

Hi, from version v20231105 there is an extra option called --max_words_per_line to fix a maximum number of words per subtitle line. You can check the PR for more details and make your own tests.

From my experience, subtitles generated using this option are more pleasant comparing with the results I obtained using --max_line_width. Depending on the length of the words, subtitles can be longer or smaller, but you can expect always a maximum amount of words per line. Additionally, --max_words_per_line will respect end of the segments, so when a sentence finish and there is a small gap of time to start the next one, subtitle lines won't join the end with the start and they will keep less time hanging in the screen. That means that finishing lines can have less words than the number you set, like:
For --max_words_per_line 3

1
00:00:03,820 --> 00:00:05,380
One two three.

2
00:00:06,300 --> 00:00:07,120
One two three.

3
00:00:07,120 --> 00:00:07,840
One two three

4
00:00:07,840 --> 00:00:08,300
four.
Code example
import whisper
from whisper.utils import get_writer

model = whisper.load_model('base')
audio = whisper.load_audio("test.wav")
result = whisper.transcribe(model, audio, language="en", word_timestamps=True)

srt_writer = get_writer("srt", ".")
srt_writer(result, "test.wav", {"max_words_per_line":3})
Still if you want to have a better prediction of the subtitles lenght, the usage of --max_line_width is a better choice.

Notice that you need word_timestamps=True to make it work.

0 replies

Improve default line lengths in subtitle files #314

Replies: 22 comments · 49 replies

brainwane Oct 27, 2022 Author

brainwane Oct 29, 2022 Author

Replies: 22 comments 49 replies

brainwane
Oct 27, 2022
Author

brainwane Oct 29, 2022
Author