Inconsistent result when use different embeddin config #1451

leohuang2013 · 2023-08-27T02:02:59Z

I used following code to do speaker diarization,

from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization@2.1",
                                    use_auth_token="ACCESS_TOKEN_GOES_HERE")


# apply the pipeline to an audio file
diarization = pipeline("audio.wav")

After reading source code of pyannote/audio/core/pipeline.py, I noticed that embedding model name is read from config.yaml file which in local model folder, and there are 2 embedding models can be configured,

pyannote/embedding
speechbrain/spkrec-ecapa-voxceleb

--- pyannote/embedding

=================================================
 1 pipeline:
  2   name: pyannote.audio.pipelines.SpeakerDiarization
  3   params:
  4     clustering: AgglomerativeClustering
  5     embedding: pyannote/embedding
  6     embedding_batch_size: 32
  7     embedding_exclude_overlap: true
  8     segmentation: pyannote/segmentation@2022.07
  9     segmentation_batch_size: 32
 10 
 11 params:
 12   clustering:
 13     method: centroid
 14     min_cluster_size: 15
 15     threshold: 0.7153814381597874
 16   segmentation:
 17     min_duration_off: 0.5817029604921046
 18     threshold: 0.4442333667381752
=================================================

--- speechbrain/spkrec-ecapa-voxceleb

=================================================
 1 pipeline:
  2   name: pyannote.audio.pipelines.SpeakerDiarization
  3   params:
  4     clustering: AgglomerativeClustering
  5     embedding: speechbrain/spkrec-ecapa-voxceleb
  6     embedding_batch_size: 32
  7     embedding_exclude_overlap: true
  8     segmentation: pyannote/segmentation@2022.07
  9     segmentation_batch_size: 32
 10 
 11 params:
 12   clustering:
 13     method: centroid
 14     min_cluster_size: 15
 15     threshold: 0.7153814381597874
 16   segmentation:
 17     min_duration_off: 0.5817029604921046
 18     threshold: 0.4442333667381752
=================================================

So I changed embedding in config.yaml to above value and tested with wav file, I found second model: speechbrain/spkrec-ecapa-voxceleb gave more accurate result. My question is what else I can change to make diarization result of pyannote/embedding is same accurate as speechbrain/spkrec-ecapa-voxceleb. The reason why I want to use pyannote/embedding, because it is much faster than speechbrain/spkrec-ecapa-voxceleb, 6s vs 15s for audio.wav file( in attachment audio.wav.zip) on my machine.

The text was updated successfully, but these errors were encountered:

github-actions · 2023-08-27T02:03:23Z

Thank you for your issue.You might want to check the FAQ if you haven't done so already.

Feel free to close this issue if you found an answer in the FAQ.

If your issue is a feature request, please read this first and update your request accordingly, if needed.

If your issue is a bug report, please provide a minimum reproducible example as a link to a self-contained Google Colab notebook containing everthing needed to reproduce the bug:

installation
data preparation
model download
etc.

Providing an MRE will increase your chance of getting an answer from the community (either maintainers or other power users).

Companies relying on pyannote.audio in production may contact me via email regarding:

paid scientific consulting around speaker diarization and speech processing in general;
custom models and tailored features (via the local tech transfer office).

This is an automated reply, generated by FAQtory

hbredin · 2023-08-28T07:22:01Z

pyannote/embedding will give you worse results and that is expected because the speechbrain model is better (it might change in the future -- no ETA, though).
the clustering/threshold you are using is suited for speechbrain but not for pyannote/embedding: you need to optimize this threshold specifically for pyannote/embedding.

leohuang2013 · 2023-08-28T15:00:48Z

... it might change in the future
Does this mean pyannote/embedding will be optimized? or there will be better model than speechbrain/spkrec-ecapa-voxceleb?
... you need to optimize this threshold specifically for pyannote/embedding
Thanks for tip, any rough idea/direction to adjust its value, like range or something?

stale · 2024-02-24T15:03:04Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the wontfix label Feb 24, 2024

stale bot closed this as completed Mar 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent result when use different embeddin config #1451

Inconsistent result when use different embeddin config #1451

leohuang2013 commented Aug 27, 2023 •

edited

Loading

github-actions bot commented Aug 27, 2023

hbredin commented Aug 28, 2023

leohuang2013 commented Aug 28, 2023

stale bot commented Feb 24, 2024

Inconsistent result when use different embeddin config #1451

Inconsistent result when use different embeddin config #1451

Comments

leohuang2013 commented Aug 27, 2023 • edited Loading

github-actions bot commented Aug 27, 2023

hbredin commented Aug 28, 2023

leohuang2013 commented Aug 28, 2023

stale bot commented Feb 24, 2024

leohuang2013 commented Aug 27, 2023 •

edited

Loading