Improve timestamp heuristics. #1461

ryanheise · 2023-06-21T15:25:28Z

Further improvements to the timestamp heuristics.

In that example, the first two words of a segment were significantly stretched. There is an existing heuristic to handle this, but only for the first two words of a "window". Therefore, this PR moves the timestamp heuristics out of find_alignment (where the segment boundaries are unavailable), and moves it out one level to add_word_timestamps where the segment boundaries are known. Since some of the heuristics were already here, this effectively moves all of the timestamp heuristics into the same place.

The same heuristic was also made more robust by only applying it if it looks like there was a pause between this segment and the last. If the two segments are tightly packed together, the timestamps are left as is.

I also continued work from one of the heuristics in #1114 . In particular, a heuristic tries to pick the segment end timestamp over the word end timestamp if the word timestamp looks far off where it should be. This PR adds a mirror heuristic for picking between the "start" timestamp from the segment vs the word.

I would note that there is a magic number of 0.5 in the code:

            # prefer the segment-level start timestamp if the first word is too long.
            if (
                segment["start"] < words[0]["end"]
                and segment["start"] - 0.5 > words[0]["start"]
            ):
                words[0]["start"] = max(
                    0, min(words[0]["end"] - median_duration, segment["start"])
                )
            else:
                segment["start"] = words[0]["start"]

The 0.5 number comes from the fact that segment timestamps can sometimes be unreliable and snap to the nearest integer, and half of the time they will snap one way, and the other half the other way. Thus if a segment start timestamp snaps to the right, it could overshoot and cause the subtitle to come in late, if we let this heuristic get too close. The 0.5 creates a buffer to prevent that. In the end, we don't know whether or not we can trust the segment timestamp, but outside of that unsafe 0.5 zone, we can at least consider the segment start timestamp to be MORE reliable than the word timestamp if the word timestamp was too far in the past, indicating probable poor alignment, which is known to happen. Similar logic applies to the other heuristic for the segment end timestamp.

ryanheise · 2023-06-23T01:22:43Z

GitHub workflow seems to be stuck on these two:

whisper-test (3.8, 1.10.2) Expected — Waiting for status to be reported
Required
whisper-test (3.9, 1.10.2) Expected — Waiting for status to be reported
Required

hoonlight · 2023-08-03T09:00:51Z

Hi, thanks for the good changes. The timestamps are much improved.

One thing, what do you think about setting a default value when len(word_durations) < 3 or median_duration value is too large, in my case 0.5 ~ 0.75 worked fine, I'm curious to know your opinion.

Also, when the gap between the first and second word was too large, setting the start of the second word to segment start instead of the original timestamp improved it, though I'm not sure why.

ryanheise · 2023-08-03T09:26:48Z

The heuristic could well be sensitive to different median word durations, although I don't currently have a good intuition on how. In the case of anchoring the second word to the segment start, so that the first word would be before the segment start, that would imply that the segment start was wrong (it should have been earlier). One other way to handle this case (if it's important that a segment start never comes in late) is to add a bit of padding to the start and end of every segment.

hoonlight · 2023-08-03T09:49:00Z

Thanks for the feedback. I understand what you mean.

I'm not very good at English, so I made some mistakes. Let me try to explain it again.

After applying the heuristic, if the gap between the first and second word was too long (in some cases more than a few tens of seconds), I ignored the timestamp of the first word and reset the start of the second word to the start of the segment, like this

if words[1]["start"] - words[0]["end"] > max_duration * 2: 
    segment["start"] = words[1]["start"]

However, this is too crude a workaround, so perhaps segmenting the sentence based on the distance between words would be worth a try.

Anyway, thanks for the good PR, I've made a lot of improvements thanks to your ideas.

* Improve timestamp heuristics. * Track pauses with last_speech_timestamp

Improve timestamp heuristics.

814f5e0

ryanheise marked this pull request as draft June 21, 2023 15:26

Track pauses with last_speech_timestamp

2d9e7d1

ryanheise marked this pull request as ready for review June 22, 2023 11:08

jongwook merged commit f572f21 into openai:main Jun 29, 2023

hoonlight mentioned this pull request Jul 5, 2023

Improve timestamp heuristics SYSTRAN/faster-whisper#336

Merged

abyesilyurt pushed a commit to abyesilyurt/whisper that referenced this pull request Nov 13, 2023

Improve timestamp heuristics. (openai#1461)

c2fb145

* Improve timestamp heuristics. * Track pauses with last_speech_timestamp

ryanheise deleted the improve-timestamps branch November 18, 2023 03:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve timestamp heuristics. #1461

Improve timestamp heuristics. #1461

ryanheise commented Jun 21, 2023 •

edited

Loading

ryanheise commented Jun 23, 2023

hoonlight commented Aug 3, 2023 •

edited

Loading

ryanheise commented Aug 3, 2023

hoonlight commented Aug 3, 2023 •

edited

Loading

Improve timestamp heuristics. #1461

Improve timestamp heuristics. #1461

Conversation

ryanheise commented Jun 21, 2023 • edited Loading

ryanheise commented Jun 23, 2023

hoonlight commented Aug 3, 2023 • edited Loading

ryanheise commented Aug 3, 2023

hoonlight commented Aug 3, 2023 • edited Loading

ryanheise commented Jun 21, 2023 •

edited

Loading

hoonlight commented Aug 3, 2023 •

edited

Loading

hoonlight commented Aug 3, 2023 •

edited

Loading