SA-toolkit

SA-toolkit: Speaker speech anonymization toolkit in python

SA-toolkit is a pytorch-based library providing pipelines and basic building blocs for evaluating and designing speaker anonymization techniques.

Features include:

ASR training with a pytorch kaldi LF-MMI wrapper (evaluation, and VC linguistic feature)
VC HiFi-GAN training with on-the-fly feature caching (anonymization)
ASV training (evaluation)
WER Utility and EER/Linkability/Cllr Privacy evaluations
Clear and simplified egs directories
Unified trainer/configs
TorchScript YAAPT & TorchScript kaldi.fbank (with batch processing!)
On the fly only feature extraction
100% TorchScript JIT-compatible network models

All data are formatted with kaldi-like wav.scp, spk2utt, text, etc.
Kaldi is necessary for training the ASR models and the handy run.pl/ssh.pl/data_split.. scripts, but most of the actual logic is performed in python; you won't have to deal kaldi ;)

Installation

The best way to install the SA-toolkit is with the install.sh script, which setup a miniconda environment, and kaldi. Take a look at the script and adapt it to your cluster configuration, or leave it do it's magic.

git clone https://github.com/deep-privacy/SA-toolkit
./install.sh

Another way of installing SA-toolkit is with pip3, this will setup everything for inference/testing.

pip3 install 'git+https://github.com/deep-privacy/SA-toolkit.git@master#egg=satools&subdirectory=satools'

Anonymize bin

Once installed (with any of the above ways), you will have access to the anonymize bin in your PATH that you can use together with a config (example: here) to anonymize a kaldi like directory.

anonymize --config ./configs/anon_pipelines --directory ./data/XXX

Quick Torch HUB anonymization example

This locally installs satools (the required pip dependencies are: torch and torchaudio).
This version gives access to the python/torch model for inference/testing, but for training use install.sh. You can modify tag_version accordingly to the available model tag here.

import torch

model = torch.hub.load("deep-privacy/SA-toolkit", "anonymization", tag_version="hifigan_bn_tdnnf_wav2vec2_vq_48_v1", trust_repo=True)
wav_conv = model.convert(torch.rand((1, 77040)), target="1069")
asr_bn = model.get_bn(torch.rand((1, 77040))) # (ASR-BN extraction for disentangled linguistic features (best with hifigan_bn_tdnnf_wav2vec2_vq_48_v1))

VPC 2024 performances

hifigan_bn_tdnnf_600h_vq_48_v1 (VPC-B5)

---- ASV_eval^anon results ----
 dataset split gender enrollment trial     EER
   libri  test      f       anon  anon  21.146
   libri  test      m       anon  anon  21.137

---- ASR results ----
 dataset split       asr    WER
   libri   dev      anon  9.693
   libri  test      anon  9.092

hifigan_bn_tdnnf_wav2vec2_vq_48_v1 (VPC-B6)

---- ASV_eval^anon results ----
 dataset split gender enrollment trial     EER
   libri  test      f       anon  anon  33.946
   libri  test      m       anon  anon  34.729

---- ASR results ----
 dataset split       asr    WER
   libri   dev      anon  4.731
   libri  test      anon  4.369

hifigan_bn_tdnnf_wav2vec2_vq_48_v1+f0-transformation=quant_16_awgn_2 (Add F0 transformations)

---- ASR results ----
 dataset split       asr    WER
   libri  test  original  1.844
   libri  test      anon  4.814

---- ASV_eval^anon results ----
 dataset split gender enrollment trial     EER
   libri  test      f       anon  anon  42.151
   libri  test      m       anon  anon  40.755

hifigan_inception_bn_tdnnf_wav2vec2_train_600_vq_48_v1+f0-transformation=quant_16_awgn_2 (hifigan train to match a single speaker + F0 transformations)

---- ASR results ----
 dataset split       asr    WER
   libri  test  original  1.844
   libri  test      anon  4.209

---- ASV_eval^anon results ----
 dataset split gender enrollment trial     EER
   libri  test      f       anon  anon  35.765
   libri  test      m       anon  anon  35.195

Note: The model was trained with a custom implementation of yaapt, yielding lower speech naturalness than the original. (Maybe for the benefit of better privacy)

Quick JIT anonymization example

This version does not rely on any dependencies using TorchScript.

import torch
import torchaudio
waveform, _, text_gt, speaker, chapter, utterance = torchaudio.datasets.LIBRISPEECH("/tmp", "dev-clean", download=True)[1]
torchaudio.save(f"/tmp/clear_{speaker}-{chapter}-{str(utterance)}.wav", waveform, 16000)

model = torch.jit.load("__Exp_Path__/final.jit").eval()
wav_conv = model.convert(waveform, target="1069")
torchaudio.save(f"/tmp/anon_{speaker}-{chapter}-{str(utterance)}.wav", wav_conv, 16000)

Ensure you have the model downloaded. Check the egs/vc directory for more detail.

Quick evaluation example

cd egs/anon/vctk
./local/eval.py --config configs/eval_clear  # eval privacy/utility of the signals

Ensure you have the corresponding evaluation model trained or downloaded.

Model training

Checkout the READMEs of egs/asr/librispeech / egs/asv/voxceleb / egs/vc/libritts.

Citation

This library is the result of the work of Pierre Champion's thesis.
If you found this library useful in academic research, please cite:

@phdthesis{champion2023,
    title={Anonymizing Speech: Evaluating and Designing Speaker Anonymization Techniques},
    author={Pierre Champion},
    year={2023},
    school={Université de Lorraine - INRIA Nancy},
    type={Thesis},
}

(Also consider starring the project on GitHub.)

Acknowledgements

Idiap' pkwrap
Jik876's HifiGAN
A.Larcher's Sidekit
Organazers of the VoicePrivacy Challenge

License

Most of the software is distributed under Apache 2.0 License (http://www.apache.org/licenses/LICENSE-2.0); the parts distributed under other licenses are indicated by a LICENSE file in related directories.

Evaluation choices

As outlined in the thesis, selecting the appropriate target identities for voice conversion is crucial for privacy evaluation. We strongly encourage the use of any-to-one voice conversion as it provides the greatest level of guarantee regarding unlinkable speech generation and facilitates proper training of a white-box ASV evaluation model. Additionally, this approach is easy to comprehend (everyone should sounds like a single identity) and enables using one-hot encoding for target identity representation, which is simpler than x-vectors while still highly effective for utility preservation.
Furthermore, the thesis identifies a limitation in the current utility evaluation process. We believe that the best solution for proper assessment of utility is through subjective listening, which allows for accurate evaluation of any mispronunciations produced by the VC system.

Name		Name	Last commit message	Last commit date
Latest commit History 629 Commits
egs		egs
satools		satools
.gitignore		.gitignore
.kaldi.patch		.kaldi.patch
LICENSE		LICENSE
README.md		README.md
SA-colab.ipynb		SA-colab.ipynb
hubconf.py		hubconf.py
install.sh		install.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SA-toolkit

SA-toolkit: Speaker speech anonymization toolkit in python

Installation

Anonymize bin

Quick Torch HUB anonymization example

VPC 2024 performances

hifigan_bn_tdnnf_600h_vq_48_v1 (VPC-B5)

hifigan_bn_tdnnf_wav2vec2_vq_48_v1 (VPC-B6)

hifigan_bn_tdnnf_wav2vec2_vq_48_v1+f0-transformation=quant_16_awgn_2 (Add F0 transformations)

hifigan_inception_bn_tdnnf_wav2vec2_train_600_vq_48_v1+f0-transformation=quant_16_awgn_2 (hifigan train to match a single speaker + F0 transformations)

Quick JIT anonymization example

Quick evaluation example

Model training

Citation

Acknowledgements

License

Evaluation choices

About

Releases 23

Languages

License

deep-privacy/SA-toolkit

Folders and files

Latest commit

History

Repository files navigation

SA-toolkit

SA-toolkit: Speaker speech anonymization toolkit in python

Installation

Anonymize bin

Quick Torch HUB anonymization example

VPC 2024 performances

hifigan_bn_tdnnf_600h_vq_48_v1 (VPC-B5)

hifigan_bn_tdnnf_wav2vec2_vq_48_v1 (VPC-B6)

hifigan_bn_tdnnf_wav2vec2_vq_48_v1+f0-transformation=quant_16_awgn_2 (Add F0 transformations)

hifigan_inception_bn_tdnnf_wav2vec2_train_600_vq_48_v1+f0-transformation=quant_16_awgn_2 (hifigan train to match a single speaker + F0 transformations)

Quick JIT anonymization example

Quick evaluation example

Model training

Citation

Acknowledgements

License

Evaluation choices

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 23

Languages