Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
lars76 authored Jul 30, 2024
1 parent bf94894 commit 77cf4d6
Showing 1 changed file with 14 additions and 3 deletions.
17 changes: 14 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,19 +13,30 @@ If you discover something that can further improve the speech quality, please co
| Model | UTMOS | CER | Val loss | Params |
|--------------|--------|--------|----------|--------|
| LightSpeech | 2.3098 | 0.2594 | 0.6640 | 3.37M |
| FastSpeech2 | 2.5620 | 0.2550 | 0.6374 | 25.36M |
| Ground truth | 2.4276 | 0.2917 | 0.0 | - |
| FastSpeech2 | **2.5620** | **0.2550** | 0.6374 | 25.36M |
| Ground truth | 2.4276 | 0.2917 | **0.0** | - |

MOS is calculated using UTMOS (higher is better), and CER is calculated using Whisper (lower is better). The "ground truth" refers to the reconstruction of the real mel spectrograms by the vocoder `bigvgan_v2_22khz_80band_fmax8k_256x`. For predicting the generated spectrograms, we use `bigvgan_base_22khz_80band` due to its superior performance on distorted spectograms. See also my other [repository](https://github.com/lars76/bigvgan-mirror/). For validation, 14415 files are used (20% of the whole dataset).

### Audio Samples

| **Hanzi** | **Pinyin** | **IPA** |
|------------------------------------|-----------------------------------------------------------|-----------------------------------------------------|
| 按被征农地的原有用途来确定补偿 | an4 bei4 zheng1 nong2 di4 de5 yuan2 you3 yong4 tu2 lai2 que4 ding4 bu3 chang2 | an4 pei̯4 ʈʂəŋ1 nʊŋ2 ti4 tɤ5 ɥɛn2 jou̯3 jʊŋ4 tʰu2 lai̯2 tɕʰɥe4 tiŋ4 pu3 ʈʂʰaŋ2 |


#### LightSpeech

https://github.com/user-attachments/assets/b4e8bbd1-070b-405c-9c01-a941dffb1a74

#### FastSpeech2

https://github.com/user-attachments/assets/01cb62b7-f801-4584-8a65-3de647a1cc1e

#### Ground truth

https://github.com/user-attachments/assets/09a4659c-c455-47cc-9032-611c3f0cc23d

## Prediction

After downloading a model, you can generate speech using Chinese characters, pinyin, or International Phonetic Alphabet (IPA). Only PyTorch is required, but optionally matplotlib, librosa and g2pw are needed.
Expand Down Expand Up @@ -72,4 +83,4 @@ Run `CUBLAS_WORKSPACE_CONFIG=:4096:8 python train.py` to train the network. The
- **LightSpeech**: [LightSpeech](https://arxiv.org/abs/2102.04040) has demonstrated that CNN architectures can achieve similar performance to transformers with reduced computational overhead.
- **BigVGAN Vocoder**: I use [BigVGAN](https://arxiv.org/abs/2206.04658) for better vocoding quality over Hifi-GAN.
- **Pitch estimation**: Many FastSpeech implementations use DIO + StoneMask, but these perform significantly worse than neural network based approaches. Here I use [PENN](https://arxiv.org/pdf/2301.12258), the current state of the art.
- **Objective Metrics**: Instead of looking only at the mel spectrogram loss, we employ [UTMOS](https://arxiv.org/abs/2204.02152) for MOS estimation and [Whisper](https://arxiv.org/abs/2212.04356) for Character Error Rate (CER). The best parameters are selected based on speech quality (MOS), intelligibility (CER) and validation loss. I have found that MOS alone is only weakly correlated with speech quality. [This paper](https://www.arxiv.org/abs/2407.12707) also came to the same conclusion.
- **Objective Metrics**: Instead of looking only at the mel spectrogram loss, we employ [UTMOS](https://arxiv.org/abs/2204.02152) for MOS estimation and [Whisper](https://arxiv.org/abs/2212.04356) for Character Error Rate (CER). The best parameters are selected based on speech quality (MOS), intelligibility (CER) and validation loss. I have found that MOS alone is only weakly correlated with speech quality. [This paper](https://www.arxiv.org/abs/2407.12707) also came to the same conclusion.

0 comments on commit 77cf4d6

Please sign in to comment.