Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Squash long words at window and sentence boundaries. #1114

Merged
merged 4 commits into from
Apr 11, 2023

Conversation

ryanheise
Copy link
Contributor

@ryanheise ryanheise commented Mar 18, 2023

This PR improves the heuristic to squash long words at window/sentence boundaries.

The previous heuristic squashed any of the first two words of a window if they were too long, since long words at the start tend to indicate words being stretched out to cover silence at the start of the window.

This PR makes the following 3 improvements:

  1. It adds a heuristic to detect long words at sentence boundaries in the middle of the window, and takes these to be indicative of words being stretched to cover silent gaps between sentences. The long boundary words will be squashed appropriately. This prevents many cases where the first word in a sentence comes in seconds too early.
  2. It modifies the original heuristic that squashed any of the first two words if they were long, so that now it won't squash the second word if the first word wasn't also long. This is because the long-ness of a word "mid window" is less significant if it comes after a short word, except at a sentence boundary which is already handled by case 1 above.
  3. If token inference doesn't complete the last sentence in a window, that last sentence might be discarded and then timestamp alignment may come up with a super elongated word duration for the last word to cover the span where that last sentence would have been. And since that last word end timestamp determines the start of the next window, a chunk of that last sentence will skipped over. In such cases, we can still get an accurate timestamp for the end of the segment coming from the last timestamp token in the finished sequence. So if the last word of the window is super elongated, and the last timestamp token looks reasonable, we prefer it.

Test example: https://audio2.redcircle.com/episodes/6b196013-8672-43d9-be52-4332b3207d93/stream.mp3

BEFORE

69-whiskey_subbed-35-orig.mp4

AFTER

69-whiskey_subbed-35-pr.mp4

Example of 1. (detecting a gap between sentences mid-window)

In this example, both segments are within the same window. The audio exhibits a 3 second pause between sentences, but the timestamps elongate the first word of the second sentence ("Along") to cover that silence.

Before:

49
00:00:28,860 --> 00:00:30,100
and unapologetic<u>.</u>

50
00:00:30,100 --> 00:00:33,660
<u>Along</u> with co-host Matt and various guests of The 69 Whiskey Army, this dynamic group

After:

49
00:00:28,860 --> 00:00:29,620
and unapologetic<u>.</u>

50
00:00:32,900 --> 00:00:33,660
<u>Along</u> with co-host Matt and various guests of The 69 Whiskey Army, this dynamic group

Example of 2. (first 2 words of a window incorrectly classified as long)

In this example, the second segment starts a new window. However, the second word of this window ("unapologetic") is naturally long while the word before it ("and") is short, so it is incorrectly classified by the heuristic. As a result, "and apologetic" gets transported 1.5 seconds into the future after squashing.

Before:

46
00:00:26,580 --> 00:00:27,180
A show once restrained by rules and boundaries now comes straight to you raw,<u> uncensored</u>

47
00:00:28,580 --> 00:00:29,340
<u>and</u> unapologetic.

48
00:00:29,340 --> 00:00:28,860
and<u> unapologetic</u>.

49
00:00:28,860 --> 00:00:30,100
and unapologetic<u>.</u>

50
00:00:30,100 --> 00:00:33,660
<u>Along</u> with co-host Matt and various guests of The 69 Whiskey Army, this dynamic group

After:

46
00:00:26,580 --> 00:00:27,180
A show once restrained by rules and boundaries now comes straight to you raw,<u> uncensored</u>

47
00:00:27,180 --> 00:00:27,480
<u>and</u> unapologetic.

48
00:00:27,480 --> 00:00:28,860
and<u> unapologetic</u>.

49
00:00:28,860 --> 00:00:29,620
and unapologetic<u>.</u>

50
00:00:32,900 --> 00:00:33,660
<u>Along</u> with co-host Matt and various guests of The 69 Whiskey Army, this dynamic group

It's a minor change to the original heuristic but I didn't want to modify it too much since I didn't have the original test cases that may have inspired it. My inclination would have been to get rid of it and just generalise (1) to handle both sentence and window boundaries the same way since I have never seen a case where the first TWO words were both elongated, I've only seen the words immediately touching a boundary become elongated (at least since the introduction of the new word-level timestamps feature).

So I've left this in there for now, and if you had test cases where the first two words were elongated, you might be able to come back to this point.

Example of 3. (end of window being skipped and last word incorrectly elongated)

In this example, the second segment starts a new window. At the end of the first window, the word "you're" is elongated to almost 4 seconds which masks words "going to need to understand because" from the original audio which we never see transcribed - they are skipped over because the next window starts too late.

Before:

1244
00:07:47,980 --> 00:07:51,340
and girls, these are all the things your parents didn't want you to understand about, but<u> you're</u>

1245
00:07:54,800 --> 00:07:55,280
<u>okay</u>.

After:

1245
00:07:47,980 --> 00:07:48,360
and girls, these are all the things your parents didn't want you to understand about, but<u> you're</u>

1246
00:07:48,360 --> 00:07:48,520
<u>going</u> to need to understand because it's okay.

@ryanheise ryanheise marked this pull request as draft March 21, 2023 06:11
@ryanheise
Copy link
Contributor Author

So I've left this in there for now, and if you had test cases where the first two words were elongated, you might be able to come back to this point.

I've converted this PR to a draft, because I think case 2 (i.e. the original heuristic) probably should be improved further. It looks like it could make the 2nd and 3rd word of a window overlap. And it also looks like it is still susceptible to transporting the first two words a long distance if the 3rd word of a window begins a new sentence.

@ryanheise
Copy link
Contributor Author

From the original heuristic, start_time[1] could be shifted right beyond both end_time[1] and start_time[2]:

        if len(word_durations) >= 2 and word_durations[1] > max_duration:
            boundary = max(end_times[2] / 2, end_times[2] - max_duration)
            end_times[0] = start_times[1] = boundary

I think that was supposed to be:

        if len(word_durations) >= 2 and word_durations[1] > max_duration:
            boundary = max(end_times[1] / 2, end_times[1] - max_duration)
            end_times[0] = start_times[1] = boundary

@ryanheise ryanheise marked this pull request as ready for review March 21, 2023 13:17
@ryanheise
Copy link
Contributor Author

I fixed the above issue and a similar issue, and so I've now removed the draft status of the PR.

@seba-aguila
Copy link

seba-aguila commented Mar 28, 2023

Hi Ryan, I tried your improvements, and they work really well. However, I wanted to ask you could tell me how did you get the subtitles with the underlines highlighting the words, because passing the srt file right away to moviepy shows the tag instead of underlining the words. I thought it was due to the font I was using but it was not. Hope you can help me.

@ryanheise
Copy link
Contributor Author

Thanks, @seba-aguila . To generate the video above, I used:

ffmpeg -f lavfi -i color=size=720x120:rate=25:color=black -i input.mp3 -vf "subtitles=input.srt:force_style='Fontsize=70'" -shortest output.mp4

@jongwook jongwook merged commit 255887f into openai:main Apr 11, 2023
zackees pushed a commit to zackees/whisper that referenced this pull request May 5, 2023
* Squash long words at window and sentence boundaries.

* Formatting requirements.

* Fix squashing logic to point to correct words.

---------

Co-authored-by: Jong Wook Kim <jongwook@openai.com>
ilanit1997 pushed a commit to ilanit1997/whisper that referenced this pull request May 16, 2023
* Squash long words at window and sentence boundaries.

* Formatting requirements.

* Fix squashing logic to point to correct words.

---------

Co-authored-by: Jong Wook Kim <jongwook@openai.com>
abyesilyurt pushed a commit to abyesilyurt/whisper that referenced this pull request Nov 13, 2023
* Squash long words at window and sentence boundaries.

* Formatting requirements.

* Fix squashing logic to point to correct words.

---------

Co-authored-by: Jong Wook Kim <jongwook@openai.com>
@ryanheise ryanheise deleted the truncate-long-words branch November 18, 2023 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants