This is an official implementation of a neural network architecture presented in the paper Reshape Dimensions Network for Speaker Recognition.
Speaker Recognition NN architectures comparison (2024)
- 2024.11.13 Refactored model's code. Added first pretrained models on
voxblink2
dataset, for more info please refer to evaluation page. - 2024.07.15 Adding model builder and pretrained weights for:
b0
,b1
,b2
,b3
,b5
,b6
model sizes.
We introduce Reshape Dimensions Network (ReDimNet), a novel neural network architecture for spectrogram (audio) processing, specifically for extracting utterance-level speaker representations. ReDimNet reshapes dimensionality between 2D feature maps and 1D signal representations, enabling the integration of 1D and 2D blocks within a single model. This architecture maintains the volume of channel-timestep-frequency outputs across both 1D and 2D blocks, ensuring efficient aggregation of residual feature maps. ReDimNet scales across various model sizes, from 1 to 15 million parameters and 0.5 to 20 GMACs. Our experiments show that ReDimNet achieves state-of-the-art performance in speaker recognition while reducing computational complexity and model size compared to existing systems.
ReDimNet architecture
PyTorch>=2.0
import torch
# To load pretrained on vox2 model without Large-Margin finetuning
model = torch.hub.load('IDRnD/ReDimNet', 'ReDimNet', model_name='b2', train_type='ptn', dataset='vox2')
# To load pretrained on vox2 model with Large-Margin finetuning:
model = torch.hub.load('IDRnD/ReDimNet', 'ReDimNet', model_name='b2', train_type='ft_lm', dataset='vox2')
For full list of pretrained models, please refer to evaluation
If you find our work helpful and you used this code in your research, please cite:
@inproceedings{yakovlev24_interspeech,
title = {Reshape Dimensions Network for Speaker Recognition},
author = {Ivan Yakovlev and Rostislav Makarov and Andrei Balykin and Pavel Malov and Anton Okhotnikov and Nikita Torgashov},
year = {2024},
booktitle = {Interspeech 2024},
pages = {3235--3239},
doi = {10.21437/Interspeech.2024-2116},
}
For model training we used wespeaker pipeline.
Some of the layers we ported from transformers.