From 43f2d8ac5203f5058ff0a8b016c385256d637452 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Herv=C3=A9=20BREDIN?= Date: Tue, 26 Sep 2023 11:10:40 +0200 Subject: [PATCH 1/8] doc: update changelog --- CHANGELOG.md | 43 ++++++++++++++++++------------------------- 1 file changed, 18 insertions(+), 25 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 0da78ec6c..897b318ab 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,24 +1,30 @@ # Changelog -## Version 3.0 (xxxx-xx-xx) +## Version 3.0.0 (2023-xx-xx) -### Highlights +### Features and improvements -- *"Harder"*. Fixed [major reproducibility issue](https://github.com/pyannote/pyannote-audio/issues/1370) with Ampere (A100) NVIDIA GPUs - In case you tried `pyannote.audio` pretrained pipelines in the past on Ampera (A100) NVIDIA GPUs - and were disappointed by the accuracy, please give it another try with this new version. -- "Better". -- "Faster". -- "Stronger". + - feat(pipeline): send pipeline to device with `pipeline.to(device)` + - feat(pipeline): add `return_embeddings` option to `SpeakerDiarization` pipeline + - feat(pipeline): make `segmentation_batch_size` and `embedding_batch_size` mutable in `SpeakerDiarization` pipeline (they now default to `1`) + - feat(pipeline): add progress hook to pipelines + - feat(task): add [powerset](https://www.isca-speech.org/archive/interspeech_2023/plaquet23_interspeech.html) support to `SpeakerDiarization` task + - feat(task): add support for multi-task models + - feat(task): add support for label scope in speaker diarization task + - feat(task): add support for missing classes in multi-label segmentation task + - feat(model): add segmentation model based on torchaudio self-supervised representation + - feat(pipeline): check version compatibility at load time + - improve(task): load metadata as tensors rather than pyannote.core instances + - improve(task): improve error message on missing specifications ### Breaking changes - BREAKING(task): rename `Segmentation` task to `SpeakerDiarization` - - BREAKING(task): remove support for variable chunk duration for segmentation tasks - BREAKING(pipeline): pipeline defaults to CPU (use `pipeline.to(device)`) - BREAKING(pipeline): remove `SpeakerSegmentation` pipeline (use `SpeakerDiarization` pipeline) - - BREAKING(pipeline): remove support for `FINCHClustering` and `HiddenMarkovModelClustering` - BREAKING(pipeline): remove `segmentation_duration` parameter from `SpeakerDiarization` pipeline (defaults to `duration` of segmentation model) + - BREAKING(task): remove support for variable chunk duration for segmentation tasks + - BREAKING(pipeline): remove support for `FINCHClustering` and `HiddenMarkovModelClustering` - BREAKING(setup): drop support for Python 3.7 - BREAKING(io): channels are now 0-indexed (used to be 1-indexed) - BREAKING(io): multi-channel audio is no longer downmixed to mono by default. @@ -29,21 +35,8 @@ - BREAKING(model): get rid of (flaky) `Model.introspection` If, for some weird reason, you wrote some custom code based on that, you should instead rely on `Model.example_output`. + - BREAKING(interactive): remove support for Prodigy recipes -### Features and improvements - - - feat(task): add [powerset](https://www.isca-speech.org/archive/interspeech_2023/plaquet23_interspeech.html) support to `SpeakerDiarization` task - - feat(task): add support for multi-task models - - feat(task): add support for label scope in speaker diarization task - - feat(task): add support for missing classes in multi-label segmentation task - - feat(model): add segmentation model based on torchaudio self-supervised representation - - feat(pipeline): send pipeline to device with `pipeline.to(device)` - - feat(pipeline): add `return_embeddings` option to `SpeakerDiarization` pipeline - - feat(pipeline): make `segmentation_batch_size` and `embedding_batch_size` mutable in `SpeakerDiarization` pipeline (they now default to `1`) - - feat(pipeline): add progress hook to pipelines - - feat(pipeline): check version compatibility at load time - - improve(task): load metadata as tensors rather than pyannote.core instances - - improve(task): improve error message on missing specifications ### Fixes and improvements @@ -54,7 +47,7 @@ - fix(task): fix support for "balance" option - improve(task): shorten and improve structure of Tensorboard tags -### Dependencies +### Dependencies update - setup: switch to torch 2.0+, torchaudio 2.0+, soundfile 0.12+, lightning 2.0+, torchmetrics 0.11+ - setup: switch to pyannote.core 5.0+, pyannote.database 5.0+, and pyannote.pipeline 3.0+ From 6e3edcfc2b76ac9e70bcacb52d719d2619a6e87c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Herv=C3=A9=20BREDIN?= Date: Tue, 26 Sep 2023 11:11:08 +0200 Subject: [PATCH 2/8] doc: update README --- README.md | 99 ++++++++++++++++++++++++++----------------------------- 1 file changed, 47 insertions(+), 52 deletions(-) diff --git a/README.md b/README.md index 1d314cd92..330cfa545 100644 --- a/README.md +++ b/README.md @@ -1,30 +1,39 @@ -> [!IMPORTANT] -> I propose (paid) scientific [consulting services](https://herve.niderb.fr/consulting.html) to companies willing to make the most of their data and open-source speech processing toolkits (and `pyannote` in particular). +Using `pyannote.audio` open-source toolkit in production? +Make the most of it thanks to our [consulting services](https://herve.niderb.fr/consulting.html). -# Speaker diarization with `pyannote.audio` -`pyannote.audio` is an open-source toolkit written in Python for speaker diarization. Based on [PyTorch](pytorch.org) machine learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines. +# `pyannote.audio` speaker diarization toolkit [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pyannote/pyannote-audio/blob/develop/tutorials/intro.ipynb) + + +`pyannote.audio` is an open-source toolkit written in Python for speaker diarization. Based on [PyTorch](pytorch.org) machine learning framework, it comes with state-of-the-art [pretrained models and pipelines](https://hf.co/pyannote), that can be further finetuned to your own data for even better performance.

-## TL;DR [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pyannote/pyannote-audio/blob/develop/tutorials/intro.ipynb) +## TL;DR + +1. Install [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) `3.0` with `pip install pyannote.audio` +2. Accept [`pyannote/segmentation-3.0`](https://hf.co/pyannote/segmentation-3.0) user conditions +3. Accept [`pyannote/speaker-diarization-3.0`](https://hf.co/pyannote-speaker-diarization-3.0) user conditions +4. Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens). ```python -# 1. visit hf.co/pyannote/speaker-diarization and hf.co/pyannote/segmentation and accept user conditions (only if requested) -# 2. visit hf.co/settings/tokens to create an access token (only if you had to go through 1.) -# 3. instantiate pretrained speaker diarization pipeline from pyannote.audio import Pipeline -pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization", - use_auth_token="ACCESS_TOKEN_GOES_HERE") +pipeline = Pipeline.from_pretrained( + "pyannote/speaker-diarization-3.0", + use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE") -# 4. apply pretrained pipeline +# send pipeline to GPU (when available) +import torch +pipeline.to(torch.device("cuda")) + +# apply pretrained pipeline diarization = pipeline("audio.wav") -# 5. print the result +# print the result for turn, _, speaker in diarization.itertracks(yield_label=True): print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}") # start=0.2s stop=1.5s speaker_0 @@ -39,16 +48,7 @@ for turn, _, speaker in diarization.itertracks(yield_label=True): - :exploding_head: state-of-the-art performance (see [Benchmark](#benchmark)) - :snake: Python-first API - :zap: multi-GPU training with [pytorch-lightning](https://pytorchlightning.ai/) -- :control_knobs: data augmentation with [torch-audiomentations](https://github.com/asteroid-team/torch-audiomentations) - -## Installation -Only Python 3.8+ is supported. - -```bash -# install from develop branch -pip install -qq https://github.com/pyannote/pyannote-audio/archive/refs/heads/develop.zip -``` ## Documentation @@ -72,54 +72,49 @@ pip install -qq https://github.com/pyannote/pyannote-audio/archive/refs/heads/de - 2022-12-02 > ["How I reached 1st place at Ego4D 2022, 1st place at Albayzin 2022, and 6th place at VoxSRC 2022 speaker diarization challenges"](tutorials/adapting_pretrained_pipeline.ipynb) - 2022-10-23 > ["One speaker segmentation model to rule them all"](https://herve.niderb.fr/fastpages/2022/10/23/One-speaker-segmentation-model-to-rule-them-all) - 2021-08-05 > ["Streaming voice activity detection with pyannote.audio"](https://herve.niderb.fr/fastpages/2021/08/05/Streaming-voice-activity-detection-with-pyannote.html) -- Miscellaneous - - [Training with `pyannote-audio-train` command line tool](tutorials/training_with_cli.md) - - [Annotating your own data with Prodigy](tutorials/prodigy.md) - - [Speaker verification](tutorials/speaker_verification.ipynb) - - Visualization and debugging +- Videos + - [Introduction to speaker diarization](https://umotion.univ-lemans.fr/video/9513-speech-segmentation-and-speaker-diarization/) / JSALT 2023 summer school / 90 min + - [Speaker segmentation model](https://www.youtube.com/watch?v=wDH2rvkjymY) / Interspeech 2021 / 3 min + - [First releaase of pyannote.audio](https://www.youtube.com/watch?v=37R_R82lfwA) / ICASSP 2020 / 8 min ## Benchmark -Out of the box, `pyannote.audio` default speaker diarization [pipeline](https://hf.co/pyannote/speaker-diarization) is expected to be much better (and faster) in v2.x than in v1.1. Those numbers are diarization error rates (in %) +Out of the box, `pyannote.audio` speaker diarization [pipeline](https://hf.co/pyannote/speaker-diarization-3.0) v3.0 is expected to be much better (and faster) than v2.x. Those numbers are diarization error rates (in %): -| Dataset \ Version | v1.1 | v2.0 | v2.1.1 (finetuned) | -| ---------------------- | ---- | ---- | ------------------ | -| AISHELL-4 | - | 14.6 | 14.1 (14.5) | -| AliMeeting (channel 1) | - | - | 27.4 (23.8) | -| AMI (IHM) | 29.7 | 18.2 | 18.9 (18.5) | -| AMI (SDM) | - | 29.0 | 27.1 (22.2) | -| CALLHOME (part2) | - | 30.2 | 32.4 (29.3) | -| DIHARD 3 (full) | 29.2 | 21.0 | 26.9 (21.9) | -| VoxConverse (v0.3) | 21.5 | 12.6 | 11.2 (10.7) | -| REPERE (phase2) | - | 12.6 | 8.2 ( 8.3) | -| This American Life | - | - | 20.8 (15.2) | +| Dataset \ Version | v1.1 | v2.0 | [v2.1](https://hf.co/pyannote/speaker-diarization-2.1) | [v3.0](https://hf.co/pyannote/speaker-diarization-3.0) | Premium | +| ---------------------- | ---- | ---- | ------ | ------ | --------- | +| AISHELL-4 | - | 14.6 | 14.1 | 12.3 | 12.3 | +| AliMeeting (channel 1) | - | - | 27.4 | 24.3 | 19.4 | +| AMI (IHM) | 29.7 | 18.2 | 18.9 | 19.0 | 16.7 | +| AMI (SDM) | - | 29.0 | 27.1 | 22.2 | 20.1 | +| AVA-AVD | - | - | - | 49.1 | 42.7 | +| DIHARD 3 (full) | 29.2 | 21.0 | 26.9 | 21.7 | 17.0 | +| MSDWild | - | - | - | 24.6 | 20.4 | +| REPERE (phase2) | - | 12.6 | 8.2 | 7.8 | 7.8 | +| VoxConverse (v0.3) | 21.5 | 12.6 | 11.2 | 11.3 | 9.5 | ## Citations If you use `pyannote.audio` please use the following citations: ```bibtex -@inproceedings{Bredin2020, - Title = {{pyannote.audio: neural building blocks for speaker diarization}}, - Author = {{Bredin}, Herv{\'e} and {Yin}, Ruiqing and {Coria}, Juan Manuel and {Gelly}, Gregory and {Korshunov}, Pavel and {Lavechin}, Marvin and {Fustes}, Diego and {Titeux}, Hadrien and {Bouaziz}, Wassim and {Gill}, Marie-Philippe}, - Booktitle = {ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing}, - Year = {2020}, +@inproceedings{Plaquet23, + author={Alexis Plaquet and Hervé Bredin}, + title={{Powerset multi-class cross entropy loss for neural speaker diarization}}, + year=2023, + booktitle={Proc. INTERSPEECH 2023}, } ``` ```bibtex -@inproceedings{Bredin2021, - Title = {{End-to-end speaker segmentation for overlap-aware resegmentation}}, - Author = {{Bredin}, Herv{\'e} and {Laurent}, Antoine}, - Booktitle = {Proc. Interspeech 2021}, - Year = {2021}, +@inproceedings{Bredin23, + author={Hervé Bredin}, + title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}}, + year=2023, + booktitle={Proc. INTERSPEECH 2023}, } ``` -## Support - -For commercial enquiries and scientific consulting, please contact [me](mailto:herve@niderb.fr). - ## Development The commands below will setup pre-commit hooks and packages needed for developing the `pyannote.audio` library. From ebd278e6eef25c5732424a0eab143cb8e4f814f9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Herv=C3=A9=20BREDIN?= Date: Tue, 26 Sep 2023 11:11:36 +0200 Subject: [PATCH 3/8] ci: bump version --- version.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/version.txt b/version.txt index 3e3c2f1e5..4a36342fc 100644 --- a/version.txt +++ b/version.txt @@ -1 +1 @@ -2.1.1 +3.0.0 From 67a72beb6623a727b594ed41881da839c21ca8d4 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Herv=C3=A9=20BREDIN?= Date: Tue, 26 Sep 2023 11:17:24 +0200 Subject: [PATCH 4/8] setup: add missing onnxruntime requirement --- requirements.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/requirements.txt b/requirements.txt index 7e71fe024..888ada78a 100644 --- a/requirements.txt +++ b/requirements.txt @@ -3,6 +3,7 @@ einops >=0.6.0 huggingface_hub >= 0.13.0 lightning >= 2.0.1 omegaconf >=2.1,<3.0 +onnxruntime >= 1.16.0 pyannote.core >= 5.0.0 pyannote.database >= 5.0.1 pyannote.metrics >= 3.2 From 694206d2859c612917fb702a8f40f78f31e3863d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Herv=C3=A9=20BREDIN?= Date: Tue, 26 Sep 2023 11:29:33 +0200 Subject: [PATCH 5/8] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 330cfa545..6d284af45 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -Using `pyannote.audio` open-source toolkit in production? +Using `pyannote.audio` open-source toolkit in production? Make the most of it thanks to our [consulting services](https://herve.niderb.fr/consulting.html). From bd839b206dbc4701419814cca9b01053a82bb523 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Herv=C3=A9=20BREDIN?= Date: Tue, 26 Sep 2023 11:30:34 +0200 Subject: [PATCH 6/8] Update README.md --- README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 6d284af45..55d714e36 100644 --- a/README.md +++ b/README.md @@ -79,7 +79,8 @@ for turn, _, speaker in diarization.itertracks(yield_label=True): ## Benchmark -Out of the box, `pyannote.audio` speaker diarization [pipeline](https://hf.co/pyannote/speaker-diarization-3.0) v3.0 is expected to be much better (and faster) than v2.x. Those numbers are diarization error rates (in %): +Out of the box, `pyannote.audio` speaker diarization [pipeline](https://hf.co/pyannote/speaker-diarization-3.0) v3.0 is expected to be much better (and faster) than v2.x. +Those numbers are diarization error rates (in %): | Dataset \ Version | v1.1 | v2.0 | [v2.1](https://hf.co/pyannote/speaker-diarization-2.1) | [v3.0](https://hf.co/pyannote/speaker-diarization-3.0) | Premium | | ---------------------- | ---- | ---- | ------ | ------ | --------- | From 36d204ad49c93e52510de7e55306711d83d6086b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Herv=C3=A9=20BREDIN?= Date: Tue, 26 Sep 2023 11:41:43 +0200 Subject: [PATCH 7/8] Update README.md --- README.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/README.md b/README.md index 55d714e36..13418df8d 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,7 @@ Using `pyannote.audio` open-source toolkit in production? Make the most of it thanks to our [consulting services](https://herve.niderb.fr/consulting.html). - -# `pyannote.audio` speaker diarization toolkit [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pyannote/pyannote-audio/blob/develop/tutorials/intro.ipynb) - +# `pyannote.audio` speaker diarization toolkit `pyannote.audio` is an open-source toolkit written in Python for speaker diarization. Based on [PyTorch](pytorch.org) machine learning framework, it comes with state-of-the-art [pretrained models and pipelines](https://hf.co/pyannote), that can be further finetuned to your own data for even better performance. From 9f53d847c46f196ef319617f72330e943545bab9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Herv=C3=A9=20BREDIN?= Date: Tue, 26 Sep 2023 11:49:58 +0200 Subject: [PATCH 8/8] Update CHANGELOG.md --- CHANGELOG.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 897b318ab..c444380d7 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,6 +1,6 @@ # Changelog -## Version 3.0.0 (2023-xx-xx) +## Version 3.0.0 (2023-09-26) ### Features and improvements