-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timestamps precision in milliseconds? #303
Comments
Setting For diarization you can try the method implemented in the repo below. |
Even without word timestamps, the Whisper model could predict timestamps to a 10 milliseconds precision but one of the Whisper author said that "the predicted timestamps tend to be biased towards integers" (source). |
Thanks for the tips. Indeed adding the word_timestamps keyword produces a precision of 10 milliseconds. I tried the library you suggested, but it seems it does not work for more than two speakers. Or perhaps I am wrong. We will see: JaesungHuh/SimpleDiarization#1 I have tried many diarization strategies, but, so, far everything based upon pyannote fails: |
Hello,
I am using the sample code provided:
And the timestamp precision seems to be one second:
Would it be possible to report milliseconds?
Another, unrelated, question, if I wished to perform an analysis per segment (say, gender, sentiment, emotion), how should I use the segment object?
Furthermore, I have tried numerous approaches for speaker diarization but all (I could not try Nemo-based ones because I do not have an adequate GPU) and all yields very bad results in certain scenarios when it comes to speaker attribution. I am considering a brute-force approach, any recommendations for a library I could use to compare a segment with the previous one in order to determine whether or not it is the same speaker.
Best,
Ed
The text was updated successfully, but these errors were encountered: