Update README.md

lars76 · Jul 30, 2024 · 77cf4d6 · 77cf4d6
1 parent bf94894
commit 77cf4d6
Showing 1 changed file with 14 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -13,19 +13,30 @@ If you discover something that can further improve the speech quality, please co
 | Model        | UTMOS  | CER    | Val loss | Params |
 |--------------|--------|--------|----------|--------|
 | LightSpeech  | 2.3098 | 0.2594 | 0.6640   | 3.37M  |
-| FastSpeech2  | 2.5620 | 0.2550 | 0.6374   | 25.36M |
-| Ground truth | 2.4276 | 0.2917 | 0.0      |   -    |
+| FastSpeech2  | **2.5620** | **0.2550** | 0.6374   | 25.36M |
+| Ground truth | 2.4276 | 0.2917 | **0.0**      |   -    |
 
 MOS is calculated using UTMOS (higher is better), and CER is calculated using Whisper (lower is better). The "ground truth" refers to the reconstruction of the real mel spectrograms by the vocoder `bigvgan_v2_22khz_80band_fmax8k_256x`. For predicting the generated spectrograms, we use `bigvgan_base_22khz_80band` due to its superior performance on distorted spectograms. See also my other [repository](https://github.com/lars76/bigvgan-mirror/). For validation, 14415 files are used (20% of the whole dataset).
 
 ### Audio Samples
 
+| **Hanzi**                          | **Pinyin**                                                | **IPA**                                             |
+|------------------------------------|-----------------------------------------------------------|-----------------------------------------------------|
+| 按被征农地的原有用途来确定补偿         | an4 bei4 zheng1 nong2 di4 de5 yuan2 you3 yong4 tu2 lai2 que4 ding4 bu3 chang2 | an4 pei̯4 ʈʂəŋ1 nʊŋ2 ti4 tɤ5 ɥɛn2 jou̯3 jʊŋ4 tʰu2 lai̯2 tɕʰɥe4 tiŋ4 pu3 ʈʂʰaŋ2 |
+
+
 #### LightSpeech
 
+https://github.com/user-attachments/assets/b4e8bbd1-070b-405c-9c01-a941dffb1a74
+
 #### FastSpeech2
 
+https://github.com/user-attachments/assets/01cb62b7-f801-4584-8a65-3de647a1cc1e
+
 #### Ground truth
 
+https://github.com/user-attachments/assets/09a4659c-c455-47cc-9032-611c3f0cc23d
+
 ## Prediction
 
 After downloading a model, you can generate speech using Chinese characters, pinyin, or International Phonetic Alphabet (IPA). Only PyTorch is required, but optionally matplotlib, librosa and g2pw are needed.
@@ -72,4 +83,4 @@ Run `CUBLAS_WORKSPACE_CONFIG=:4096:8 python train.py` to train the network. The
 - **LightSpeech**: [LightSpeech](https://arxiv.org/abs/2102.04040) has demonstrated that CNN architectures can achieve similar performance to transformers with reduced computational overhead.
 - **BigVGAN Vocoder**: I use [BigVGAN](https://arxiv.org/abs/2206.04658) for better vocoding quality over Hifi-GAN.
 - **Pitch estimation**: Many FastSpeech implementations use DIO + StoneMask, but these perform significantly worse than neural network based approaches. Here I use [PENN](https://arxiv.org/pdf/2301.12258), the current state of the art.
-- **Objective Metrics**: Instead of looking only at the mel spectrogram loss, we employ [UTMOS](https://arxiv.org/abs/2204.02152) for MOS estimation and [Whisper](https://arxiv.org/abs/2212.04356) for Character Error Rate (CER). The best parameters are selected based on speech quality (MOS), intelligibility (CER) and validation loss. I have found that MOS alone is only weakly correlated with speech quality. [This paper](https://www.arxiv.org/abs/2407.12707) also came to the same conclusion.
+- **Objective Metrics**: Instead of looking only at the mel spectrogram loss, we employ [UTMOS](https://arxiv.org/abs/2204.02152) for MOS estimation and [Whisper](https://arxiv.org/abs/2212.04356) for Character Error Rate (CER). The best parameters are selected based on speech quality (MOS), intelligibility (CER) and validation loss. I have found that MOS alone is only weakly correlated with speech quality. [This paper](https://www.arxiv.org/abs/2407.12707) also came to the same conclusion.