An implement of the CVPR 2021 paper: Seeking the Shape of Sound: An Adaptive Framework for Learning Voice-Face Association
- Ubuntu 16.04
- CUDA 10.2
- Python 3.7.3
- Pytorch 1.4.0
See requirement.txt.
Download VoxCeleb, VGGFace and unzip them to ./data.
Limited by file size, only part of the query lists is included in ./data
. Other lists used in the article can be downloaded from Google drive or Baidu drive (passwd: rfri).
- Download pretrained models for backbones into
./pretrained_models
.
Google drive:
Baidu drive:
SE-ResNet-50 (passwd: jy55)
Thin-ResNet-34 (passwd: tc6i)
- Train the model and update identity weights:
python3 train.py config/train_reweight.yaml
- Extract identity weights from saved model file:
python3 extract_id_weight.py config/train_reweight.yaml
The 4. Retrain the final model:
python3 train.py config/train_main.yaml
The model and log are saved in save/vox1_train/Voice2Face/main
by default.
- Download the pretrained model from Google drive or Baidu drive (passwd: 4vyf).
- Modify configures in
config/train_main.yaml
: changeresume\_eval
to the path where the model is saved. - Run
python3 eval.py config/train_main.yaml
Expected results (%):
1:2 Matching (U) | 1:2 Matching (G) | Verification (U) | Verification (G) | Retrieval | |
---|---|---|---|---|---|
Voice-to-Face | 87.2 | 77.7 | 87.2 | 77.5 | 5.5 |
Face-to-Voice | 86.5 | 75.3 | 87.0 | 76.1 | 5.8 |
The results might slightly differ from the above due to random factors in the training process.
If this code is helpful to you, please consider citing our paper:
@inproceedings{wen2021seeking,
title={Seeking the shape of sound: An adaptive framework for learning voice-face association},
author={Wen, Peisong and Xu, Qianqian and Jiang, Yangbangyan and Yang, Zhiyong and He, Yuan and Huang, Qingming},
booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2021}
}