Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timestamps precision in milliseconds? #303

Closed
mirix opened this issue Jun 15, 2023 · 3 comments
Closed

Timestamps precision in milliseconds? #303

mirix opened this issue Jun 15, 2023 · 3 comments

Comments

@mirix
Copy link

mirix commented Jun 15, 2023

Hello,

I am using the sample code provided:

from faster_whisper import WhisperModel

model_size = 'large-v2'
model = WhisperModel(model_size, device='cpu', compute_type='int8')

segments, info = model.transcribe('Michael, Jim, Dwight epic scene [qHrN5Mf5sgo].mp3', beam_size=5)

print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

for segment in segments:
	print('[%.2fs -> %.2fs] %s' % (segment.start, segment.end, segment.text))

And the timestamp precision seems to be one second:

Detected language 'en' with probability 0.988083
[0.00s -> 7.00s]  Here's what's going to happen. I am going to have to fix you, manage you to, on a more
[7.00s -> 13.00s]  personal scale, a more micro form of management. Jim, what is that called?
[13.00s -> 14.00s]  Micro Jimin.
[14.00s -> 19.00s]  Boom. Yes. Now Jim is going to be the client. Dwight, you're going to have to sell to him
[19.00s -> 24.00s]  without being aggressive, hostile, or difficult. Let's go.
[24.00s -> 28.00s]  All right, fine. Ring, ring.
[28.00s -> 29.00s]  Hello?

Would it be possible to report milliseconds?

Another, unrelated, question, if I wished to perform an analysis per segment (say, gender, sentiment, emotion), how should I use the segment object?

Furthermore, I have tried numerous approaches for speaker diarization but all (I could not try Nemo-based ones because I do not have an adequate GPU) and all yields very bad results in certain scenarios when it comes to speaker attribution. I am considering a brute-force approach, any recommendations for a library I could use to compare a segment with the previous one in order to determine whether or not it is the same speaker.

Best,

Ed

@hoonlight
Copy link
Contributor

hoonlight commented Jun 15, 2023

Setting word_timestamps=True causes the timestamps of segments to be displayed in milliseconds.
Even if you don't use word timestamps, setting this option will make the timestamps of all segments more precise in milliseconds.

For diarization you can try the method implemented in the repo below.

https://github.com/JaesungHuh/SimpleDiarization

@guillaumekln
Copy link
Contributor

Even without word timestamps, the Whisper model could predict timestamps to a 10 milliseconds precision but one of the Whisper author said that "the predicted timestamps tend to be biased towards integers" (source).

@mirix
Copy link
Author

mirix commented Jun 15, 2023

Setting word_timestamps=True causes the timestamps of segments to be displayed in milliseconds. Even if you don't use word timestamps, setting this option will make the timestamps of all segments more precise in milliseconds.

For diarization you can try the method implemented in the repo below.

https://github.com/JaesungHuh/SimpleDiarization

Thanks for the tips. Indeed adding the word_timestamps keyword produces a precision of 10 milliseconds.

I tried the library you suggested, but it seems it does not work for more than two speakers. Or perhaps I am wrong. We will see:

JaesungHuh/SimpleDiarization#1

I have tried many diarization strategies, but, so, far everything based upon pyannote fails:

pyannote/pyannote-audio#1406

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants