Speaker Count #568

bitcarousel · 2021-01-13T12:44:26Z

bitcarousel
Jan 13, 2021

Is it correct that the number of speakers in an audiofile is only determined by the speaker embeddings model?

Following this tutorial for the speaker embedding shows 4 clusters by tSNE, however groundtruth is 3 speakers.
https://github.com/pyannote/pyannote-audio/tree/master/tutorials/pretrained/model

Is this just bit error rate?
Does anyone know if the speaker embedding clusters are equal to the number of speakers in a audiofile?
Are there better clustering algorithms recommendations (k-means etc)?

Thanks for any answer

hbredin · 2021-01-14T07:52:57Z

hbredin
Jan 14, 2021
Maintainer

pyannote.audio does speaker diarization and not speaker counting.

Of course, one can hope that speaker diarization will be perfect and contains the correct number of speakers but that is seldom the case (especially for large number of speakers).

One reason is that the speaker diarization pipeline is optimized for diarization error rate and not speaker count.

If speaker count is really what you are looking for (and you do not care about speaker diarization), I'd suggest you train a model to do just that. Unfortunately, this is not currently implemented in pyannote.audio. Upcoming v2.0 might change that by making it very easy to design and train for new tasks.

0 replies

bitcarousel · 2021-01-15T10:33:27Z

bitcarousel
Jan 15, 2021
Author

Thanks for the answer

However diarization answers the question "who speaks when" and speaker counting is just the who.
Then should diarization automatically have the who and counting the who should give the speaker count or am I completely wrong?

I can understand that there are maybe better trained models which just focus on speaker counting (expecially for large number of speakers) but I think solving the speaker count problem should also be possible by diarization.

5 replies

hbredin Jan 15, 2021
Maintainer

You are perfectly right that solving speaker diarization would also imply solving the speaker count problem -- but not the other way around. In my experience, it is almost always better to train a model directly for the final task of interest.

jbluhme Mar 25, 2021

Hi!

Follow up question on this: If one knows the total number of speakers in an audiofile and only wants to solve the problem of who spoke when, is it possible to pass this information to the speaker diarization pipeline? For example, I have a file containing audio from a doctor's appointment in which only the doctor and the patient speaks. Is it possible for me to somehow set the number of clusters to be fixed to 2? The result I get when applying the pipeline to my file is 3 speakers instead of two. Thank you very much!

Joel

Thanks Rachid for the promotion !
I'll just add that, if you're interested in exploring this approach, a tutorial to train/validate/apply such models can be found there :
https://github.com/MarvinLvn/pyannote-audio/tree/voice_type_classifier/tutorials/models/multilabel_detection

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speaker Count #568

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Speaker Count #568

bitcarousel Jan 13, 2021

Replies: 2 comments · 5 replies

hbredin Jan 14, 2021 Maintainer

bitcarousel Jan 15, 2021 Author

hbredin Jan 15, 2021 Maintainer

jbluhme Mar 25, 2021

Rachine Mar 25, 2021

Rachine Mar 25, 2021

MarvinLvn Mar 25, 2021

bitcarousel
Jan 13, 2021

Replies: 2 comments 5 replies

hbredin
Jan 14, 2021
Maintainer

bitcarousel
Jan 15, 2021
Author

hbredin Jan 15, 2021
Maintainer