Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve timestamp heuristics. #1461

Merged
merged 2 commits into from
Jun 29, 2023
Merged

Conversation

ryanheise
Copy link
Contributor

@ryanheise ryanheise commented Jun 21, 2023

Further improvements to the timestamp heuristics.

Prompted by: #1378 (reply in thread)

In that example, the first two words of a segment were significantly stretched. There is an existing heuristic to handle this, but only for the first two words of a "window". Therefore, this PR moves the timestamp heuristics out of find_alignment (where the segment boundaries are unavailable), and moves it out one level to add_word_timestamps where the segment boundaries are known. Since some of the heuristics were already here, this effectively moves all of the timestamp heuristics into the same place.

The same heuristic was also made more robust by only applying it if it looks like there was a pause between this segment and the last. If the two segments are tightly packed together, the timestamps are left as is.

I also continued work from one of the heuristics in #1114 . In particular, a heuristic tries to pick the segment end timestamp over the word end timestamp if the word timestamp looks far off where it should be. This PR adds a mirror heuristic for picking between the "start" timestamp from the segment vs the word.

I would note that there is a magic number of 0.5 in the code:

            # prefer the segment-level start timestamp if the first word is too long.
            if (
                segment["start"] < words[0]["end"]
                and segment["start"] - 0.5 > words[0]["start"]
            ):
                words[0]["start"] = max(
                    0, min(words[0]["end"] - median_duration, segment["start"])
                )
            else:
                segment["start"] = words[0]["start"]

The 0.5 number comes from the fact that segment timestamps can sometimes be unreliable and snap to the nearest integer, and half of the time they will snap one way, and the other half the other way. Thus if a segment start timestamp snaps to the right, it could overshoot and cause the subtitle to come in late, if we let this heuristic get too close. The 0.5 creates a buffer to prevent that. In the end, we don't know whether or not we can trust the segment timestamp, but outside of that unsafe 0.5 zone, we can at least consider the segment start timestamp to be MORE reliable than the word timestamp if the word timestamp was too far in the past, indicating probable poor alignment, which is known to happen. Similar logic applies to the other heuristic for the segment end timestamp.

@ryanheise ryanheise marked this pull request as draft June 21, 2023 15:26
@ryanheise ryanheise marked this pull request as ready for review June 22, 2023 11:08
@ryanheise
Copy link
Contributor Author

GitHub workflow seems to be stuck on these two:

whisper-test (3.8, 1.10.2) Expected — Waiting for status to be reported
Required
whisper-test (3.9, 1.10.2) Expected — Waiting for status to be reported
Required 

@hoonlight
Copy link

hoonlight commented Aug 3, 2023

Hi, thanks for the good changes. The timestamps are much improved.

One thing, what do you think about setting a default value when len(word_durations) < 3 or median_duration value is too large, in my case 0.5 ~ 0.75 worked fine, I'm curious to know your opinion.

Also, when the gap between the first and second word was too large, setting the start of the second word to segment start instead of the original timestamp improved it, though I'm not sure why.

@ryanheise
Copy link
Contributor Author

The heuristic could well be sensitive to different median word durations, although I don't currently have a good intuition on how. In the case of anchoring the second word to the segment start, so that the first word would be before the segment start, that would imply that the segment start was wrong (it should have been earlier). One other way to handle this case (if it's important that a segment start never comes in late) is to add a bit of padding to the start and end of every segment.

@hoonlight
Copy link

hoonlight commented Aug 3, 2023

Thanks for the feedback. I understand what you mean.

I'm not very good at English, so I made some mistakes. Let me try to explain it again.

After applying the heuristic, if the gap between the first and second word was too long (in some cases more than a few tens of seconds), I ignored the timestamp of the first word and reset the start of the second word to the start of the segment, like this

if words[1]["start"] - words[0]["end"] > max_duration * 2: 
    segment["start"] = words[1]["start"]

However, this is too crude a workaround, so perhaps segmenting the sentence based on the distance between words would be worth a try.

Anyway, thanks for the good PR, I've made a lot of improvements thanks to your ideas.

abyesilyurt pushed a commit to abyesilyurt/whisper that referenced this pull request Nov 13, 2023
* Improve timestamp heuristics.

* Track pauses with last_speech_timestamp
@ryanheise ryanheise deleted the improve-timestamps branch November 18, 2023 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants