Support for other languages #30

yaguangtang · 2019-07-02T05:06:39Z

Available languages

Chinese (Mandarin): #811
German: #571*
Swedish: #257*

* Requires Tensorflow 1.x (harder to set up).

Requested languages (not available yet)

Arabic: #871
Czech: #655
English: #388 (UK accent), #429 (Indian accent)
French: #854
Hindi: #525
Italian: #697
Polish: #815
Portuguese: #531
Russian: #707
Spanish: #789
Turkish: #761
Ukrainian: #492

CorentinJ · 2019-07-02T22:12:40Z

You'll need to retrain with your own datasets to get another language running (and it's a lot of work). The speaker encoder is somewhat able to work on a few other languages than English because VoxCeleb is not purely English, but since the synthesizer/vocoder have been trained purely on English data, any voice that is not in English - and even, that does not have a proper English accent - will be cloned very poorly.

yaguangtang · 2019-07-03T02:16:16Z

Thanks for explaintation, I have big interest of adding other languages support, and would like to contribute.

CorentinJ · 2019-07-03T06:32:23Z

You'll need a good dataset (at least ~300 hours, high quality and transcripts) in the language of your choice, do you have that?

tail95 · 2019-07-04T01:59:19Z

I wanna train another language. How many speakers do I need in the Encoder? or can I use the English speaker embeddings to my language?

CorentinJ · 2019-07-04T07:57:53Z

From here:

A particularity of the SV2TTS framework is that all models can be trained
separately and on distinct datasets. For the encoder, one seeks to have a model
that is robust to noise and able to capture the many characteristics of the human
voice. Therefore, a large corpus of many different speakers would be preferable to
train the encoder, without any strong requirement on the noise level of the audios.
Additionally, the encoder is trained with the GE2E loss which requires no labels other
than the speaker identity. (...) For the datasets of the synthesizer and the vocoder,
transcripts are required and the quality of the generated audio can only be as good
as that of the data. Higher quality and annotated datasets are thus required, which
often means they are smaller in size.

You'll need two datasets:

The first one should be a large dataset of untranscribed audio that can be noisy. Think thousands of speakers and thousands of hours. You can get away with a smaller one if you finetune the pretrained speaker encoder. Put maybe 1e-5 as learning rate. I'd recommend 500 speakers at the very least for finetuning. A good source for datasets of other languages is M-AILABS.

The second one needs audio transcripts and high quality audio. Here, finetuning won't be as effective as for the encoder, but you can get away with less data (300-500 hours). You will likely not have the alignments for that dataset, so you'll have to adapt the preprocessing procedure of the synthesizer to not split audio on silences. See the code and you'll understand what I mean.

Don't start training the encoder if you don't have a dataset for the synthesizer/vocoder, you won't be able to do anything then.

HumanG33k · 2019-07-05T15:50:10Z

You'll need a good dataset (at least ~300 hours, high quality and transcripts) in the language of your choice, do you have that?

Maybe it can be hacked by using audio book and they pdf2text version. The difficult come i guess from the level of expression on data sources. Maybe with some movies but sometimes subtitles are really poor. Firefox work on dataset to if i remember well

zbloss · 2019-07-17T03:28:25Z

You'll need a good dataset (at least ~300 hours, high quality and transcripts) in the language of your choice, do you have that?

Maybe it can be hacked by using audio book and they pdf2text version. The difficult come i guess from the level of expression on data sources. Maybe with some movies but sometimes subtitles are really poor. Firefox work on dataset to if i remember well

This is something that I have been slowly piecing together. I have been gathering audiobooks and their text versions that are in the public domain (Project Gutenberg & LibriVox Recordings). My goal as of now is to develop a solid package that can gather an audiofile and corresponding book, performing necessary cleaning and such.

Currently this project lives on my C:, but if there's interest in collaboration I'd gladly throw it here on GitHub.

JasonWei512 · 2019-07-19T03:06:07Z

How many speakers are needed for synthesizer/vocoder training?

CorentinJ · 2019-07-19T10:59:42Z

You'd want hundreds of speakers at least. In fact, LibriSpeech-clean makes for 460 speakers and it's still not enough.

boltomli · 2019-07-19T13:27:51Z

There's an open 12-hour Chinese female voice set from databaker that I tried with tacotron https://github.com/boltomli/tacotron/blob/zh/TRAINING_DATA.md#data-baker-data. Hope that I can gather more Chinese speakers to have a try on voice cloning. I'll update if I have some progress.

CorentinJ · 2019-07-19T13:29:25Z

That's not nearly enough to learn about the variations in speakers. Especially not on a hard language such as Chinese.

JasonWei512 · 2019-07-20T07:23:59Z

@boltomli Take a look at this dataset (1505 hours, 6408 speakers, recorded on smartphones):
https://www.datatang.com/webfront/opensource.html
Samples.zip
Not sure if the quality is good enough for encoder training.

CorentinJ · 2019-07-20T09:03:13Z

You actually want the encoder dataset not to always be of good quality, because that makes the encoder robust. It's different for the synthesizer/vocoder, because the quality is the output you will have (at best)

HumanG33k · 2019-07-24T08:03:20Z

You'd want hundreds of speakers at least. In fact, LibriSpeech-clean makes for 460 speakers and it's still not enough.

Can not be hack to by create new speakers with ai like it is done for picture ?

Liujingxiu23 · 2019-07-31T09:08:52Z

How about training the encoder/speaker_verification using English multi-speaker data-sets, but training the synthesizer using Chinese database, suppose both the data are enough for each individual model separately.

CorentinJ · 2019-08-01T11:10:09Z

You can do that, but I would then add the synthesizer dataset in the speaker encoder dataset. In SV2TTS, they use disjoint datasets between the encoder and the synthesizer, but I think it's simply to demonstrate that the speaker encoder generalizes well (the paper is presented as a transfer learning paper over a voice cloning paper after all).

There's no guarantee the speaker encoder works well on different languages than it was trained on. Considering the difficulty of generating good Chinese speech, you might want to do your best at finding really good datasets rather than hack your way around everything.

Liujingxiu23 · 2019-08-02T01:30:49Z

@CorentinJ Thank you for your reply，may be I should find some Chinese data-sets for ASR to train the speaker verification model.

magneter · 2019-08-03T16:06:47Z

@Liujingxiu23 Have you trained a Chinese model?And could you share your model about the Chinese clone results?

Liujingxiu23 · 2019-08-05T01:28:21Z

@magneter I have not trained the Chinese model, I don't have enough data to train the speaker verification model, I am trying to collect suitable data now

xw1324832579 · 2019-08-07T08:00:11Z

You'd want hundreds of speakers at least. In fact, LibriSpeech-clean makes for 460 speakers and it's still not enough.

@CorentinJ Hello, ignoring speakers out of training dataset, if I only want to assure the quality and similarity of wav synthesized with speakers in the training dataset(librispeech-clean), how much time (at least) for one speaker do I need for training, maybe 20 minutes or less?

CorentinJ · 2019-08-07T10:32:35Z

maybe 20 minutes or less?

Wouldn't that be wonderful. You'll still need a good week or so. A few hours if you use the pretrained model. Although at this point what you're doing is no longer voice cloning, so you're not really in the right repo for that.

shawwn · 2019-08-10T23:33:00Z

This is something that I have been slowly piecing together. I have been gathering audiobooks and their text versions that are in the public domain (Project Gutenberg & LibriVox Recordings). My goal as of now is to develop a solid package that can gather an audiofile and corresponding book, performing necessary cleaning and such.

Currently this project lives on my C:, but if there's interest in collaboration I'd gladly throw it here on GitHub.

@zbloss I'm very interested. Would you be able to upload your entire dataset somewhere? Or if it's difficult to upload, is there some way I could acquire it from you directly?

Thanks!

WendongGan · 2019-08-16T02:34:23Z

@CorentinJ @yaguangtang @tail95 @zbloss @HumanG33k I am finetuning the encoder model by Chhinese data of 3100 persons. I want to know how to judge whether the train of finetune is OK. In Figure0, The blue line is based on 2100 persons , the yellow line is based on 3100 persons which is trained now.
Figure0:

Figure1:(finetune 920k , from 1565k to 1610k steps, based on 2100 persons)

Figure2:(finetune 45k from 1565k to 1610k steps, based on 3100 persons)

I also what to know how mang steps is OK , in general. Because, I only know to train the synthesizer model and vocoder mode oneby one to judge the effect. But it will cost very long time. How about my EER or Loss ? Look forward your reply!

CorentinJ · 2019-08-16T09:29:39Z

If your speakers are cleanly separated in the space (like they are in the pictures), you should be good to go! I'd be interested to compare with the same plots but before any training step was made, to see how the model does on Chinese data.

carlsLobato · 2021-04-20T22:39:30Z

I will be trying to do the same with spanish. Wish me luck. Any suggestions about compute power?

did you get around to train the model. I found these datasets in spanish (and many other languages) https://commonvoice.mozilla.org/es/datasets

abelab1982 · 2021-05-02T19:59:42Z

Any progress on Spanish dataset training?
Algún avance en la formación del conjunto de datos español?

Same here! let me know if any news or any help for Spanish

carlsLobato · 2021-05-04T14:37:42Z

Any progress on Spanish dataset training?
Algún avance en la formación del conjunto de datos español?

Same here! let me know if any news or any help for Spanish

Hey, I ended up using tacotron2 implementation by NVIDIA. If you train it in spanish, it speaks spanish; so I guess it will
work just as good in any language https://github.com/NVIDIA/tacotron2

andreafiandro · 2021-05-28T06:29:01Z

Hello,
I tried to train the model for the italian languages but I still have some issues.
The steps I followed are:

Preprocessing of the dataset http://www.openslr.org/94/
Training of the synthetizer
Using the synthetizer to generate the input data for the vocoder
Train of the vocoder

After a long training (especially for the vocoder) the output generated by means of the toolbox is really poor (it can't "speak" italian).

Did I do something wrong or I missed some steps?

Thank you in advance

ghost · 2021-06-06T07:57:54Z

@andreafiandro Check the attention graphs from your synthesizer model training. You should get diagonal lines that look like this if attention has been learned. (This is required for inference to work) https://github.com/Rayhane-mamah/Tacotron-2/wiki/Spectrogram-Feature-prediction-network#tacotron-2-attention

If it does not look like that, you'll need additional training for the synthesizer, check the preprocessing for problems, and/or clean your dataset.

VitoCostanzo · 2021-06-15T11:22:27Z

Hello,
I tried to train the model for the italian languages but I still have some issues.
The steps I followed are:

Preprocessing of the dataset http://www.openslr.org/94/

Training of the synthetizer

Using the synthetizer to generate the input data for the vocoder

Train of the vocoder

After a long training (especially for the vocoder) the output generated by means of the toolbox is really poor (it can't "speak" italian).

Did I do something wrong or I missed some steps?

Thank you in advance

@andreafiandro please, can you share your file trained for italian language? (pretrained.pt of synthetizer)

andreafiandro · 2021-06-21T11:30:02Z

@andreafiandro Check the attention graphs from your synthesizer model training. You should get diagonal lines that look like this if attention has been learned. (This is required for inference to work) https://github.com/Rayhane-mamah/Tacotron-2/wiki/Spectrogram-Feature-prediction-network#tacotron-2-attention

If it does not look like that, you'll need additional training for the synthesizer, check the preprocessing for problems, and/or clean your dataset.

Thank you, I have something really different from expected diagonal line:

Probably I made some mistake in the data preprocessing or the dataset is too poor. I will try again, checking the results using the plots.

Do I need to edit some configuration file in order to the list of character of my language or I can follow the same training step described here?

@VitoCostanzo I can share the file if you want but it isn't working for the moment

ghost · 2021-06-21T21:33:19Z

Do I need to edit some configuration file in order to the list of character of my language or I can follow the same training step described here?

@andreafiandro - "Considerations - languages other than English" in #431 (comment)

tiomaldy · 2021-06-24T20:50:53Z

Hello i am trying to train the system in spanish
The first thing i need is train the encoder ,what i need to change in the code or what are the step by step for make the training someone can help me ?

selcuk-cofe · 2021-07-17T08:00:41Z

how to train for turkish ?

babysor · 2021-08-07T04:08:58Z

Thank you for sharing the zhrtvc pretrained models @windht ! It will not be as obvious in the future, so for anyone else who wants to try, the models work flawlessly with this commit: https://github.com/KuangDD/zhrtvc/tree/932d6e334c54513b949fea2923e577daf292b44e

What I like about zhrtvc:

Display alignments for synthesized spectrograms

Option to preprocess wavs for making the speaker embedding.

Auto-save generated wavs (though I prefer our solution in Export and replay generated wav #402)

Melgan is integrated but it doesn't work well with the default synthesizer model, so I ended up using Griffin-Lim most of the time for testing. WaveRNN quality is not that good either so it might be an issue on my end.

I'm trying to come up with ideas for this repo to support other languages without having to edit files.

All links to KuangDD's projects now are no longer accessible. I'm currently working on latest fork of this repo to support mandarin and if anyone want to use as reference, please be free to folk and train: https://github.com/babysor/Realtime-Voice-Clone-Chinese

ghost · 2021-10-08T23:23:51Z

The original issue has been edited to provide visibility of community-developed voice cloning models in other languages. I'll also use it to keep track of requests.

Hiraokii · 2021-10-16T16:39:18Z

From here:

A particularity of the SV2TTS framework is that all models can be trained
separately and on distinct datasets. For the encoder, one seeks to have a model
that is robust to noise and able to capture the many characteristics of the human
voice. Therefore, a large corpus of many different speakers would be preferable to
train the encoder, without any strong requirement on the noise level of the audios.
Additionally, the encoder is trained with the GE2E loss which requires no labels other
than the speaker identity. (...) For the datasets of the synthesizer and the vocoder,
transcripts are required and the quality of the generated audio can only be as good
as that of the data. Higher quality and annotated datasets are thus required, which
often means they are smaller in size.

You'll need two datasets:

The first one should be a large dataset of untranscribed audio that can be noisy. Think thousands of speakers and thousands of hours. You can get away with a smaller one if you finetune the pretrained speaker encoder. Put maybe 1e-5 as learning rate. I'd recommend 500 speakers at the very least for finetuning. A good source for datasets of other languages is M-AILABS.

The second one needs audio transcripts and high quality audio. Here, finetuning won't be as effective as for the encoder, but you can get away with less data (300-500 hours). You will likely not have the alignments for that dataset, so you'll have to adapt the preprocessing procedure of the synthesizer to not split audio on silences. See the code and you'll understand what I mean.

Don't start training the encoder if you don't have a dataset for the synthesizer/vocoder, you won't be able to do anything then.

this can be done with some audiobooks?

rphad23 · 2021-11-20T04:17:10Z

when will french be done?

ugurpekunsal · 2021-12-25T22:33:55Z

how to train for turkish ?

Have you had any luck with training Turkish?

@selcuk-cofe

neonsecret · 2022-04-15T10:51:43Z

I've made a custom fork https://github.com/neonsecret/Real-Time-Voice-Cloning-Multilang
It now supports training a bilingual en+ru model, and it's easy to add new languages based on my fork

Abdelrahman-Shahda · 2022-04-29T12:23:47Z

@CorentinJ I am planning to use your pre-trained modules to generate English audio but in my case I want my source audio to be Spanish so I should only worry about training the encoder right? And If I wanted to add emotions to the generated voice does the vocoder supports this?

neonsecret · 2022-04-29T12:35:47Z

@Abdelrahman-Shahda
no you should train only the synthetizer and edit the symbols.py file, see this #941

Abdelrahman-Shahda · 2022-04-29T12:50:21Z

@neonsecret Okay great. For the emotion part should I keep extracting the embedding each time rather than once for a single user(I don't know if this will cause the encoder embeddings to vary based on the emotions)

neonsecret · 2022-04-29T12:59:24Z

@Abdelrahman-Shahda I think you should just train as normal, if your emotional audio has exclamation signs in transcript (like "hello!" or "hello!!") you should be fine.

pauortegariera · 2023-03-17T15:07:14Z

Hi everyone , i would like to know how much training time does every module requires using GPU (approx.).

keshawnhsieh · 2023-03-28T03:25:44Z

I upload the latest pretained model on steps 183W

Where did you put your pretrained model on? Seems not see any links on your forked repo? @iwater

keshawnhsieh · 2023-03-29T09:05:25Z

@CorentinJ @yaguangtang @tail95 @zbloss @HumanG33k I am finetuning the encoder model by Chhinese data of 3100 persons. I want to know how to judge whether the train of finetune is OK. In Figure0, The blue line is based on 2100 persons , the yellow line is based on 3100 persons which is trained now. Figure0:

Figure1:(finetune 920k , from 1565k to 1610k steps, based on 2100 persons)

Figure2:(finetune 45k from 1565k to 1610k steps, based on 3100 persons)

I also what to know how mang steps is OK , in general. Because, I only know to train the synthesizer model and vocoder mode oneby one to judge the effect. But it will cost very long time. How about my EER or Loss ? Look forward your reply!

Could you plz share the Chinese encoder model with me? @UESTCgan

CorentinJ mentioned this issue Aug 11, 2019

what dataset form shoud be train my chineese modle?[data-form] #84

Closed

Saturate mentioned this issue Aug 26, 2019

How to synthesize voice in other languages?? #96

Closed

CorentinJ mentioned this issue Aug 27, 2019

what languages are supported #99

Closed

CorentinJ changed the title ~~Does it support other languages except English?~~ Support for other languages Aug 30, 2019

CorentinJ pinned this issue Aug 30, 2019

CorentinJ mentioned this issue Sep 7, 2019

Can this project support Chinese? #123

Closed

Wrongtown mentioned this issue Jul 6, 2021

Works in Spanish? #789

Closed

Bebaam mentioned this issue Sep 30, 2021

make the model work in french #854

Closed

ghost mentioned this issue Oct 11, 2021

Here is a model in Mandarin/Chinese 中文模型 #811

Closed

ghost pinned this issue Oct 11, 2021

Bebaam mentioned this issue Dec 9, 2021

Train Synthetizer in Spanish #941

Closed

Repository owner deleted a comment from ALIXGUSTAF May 9, 2022

Support for other languages #30

Support for other languages #30

Comments

yaguangtang commented Jul 2, 2019 • edited by ghost Loading

Available languages

Requested languages (not available yet)

CorentinJ commented Jul 2, 2019 • edited Loading

yaguangtang commented Jul 3, 2019

CorentinJ commented Jul 3, 2019

tail95 commented Jul 4, 2019 • edited Loading

CorentinJ commented Jul 4, 2019

HumanG33k commented Jul 5, 2019

zbloss commented Jul 17, 2019

JasonWei512 commented Jul 19, 2019 • edited Loading

CorentinJ commented Jul 19, 2019

boltomli commented Jul 19, 2019

CorentinJ commented Jul 19, 2019

JasonWei512 commented Jul 20, 2019 • edited Loading

CorentinJ commented Jul 20, 2019 • edited Loading

HumanG33k commented Jul 24, 2019

Liujingxiu23 commented Jul 31, 2019

CorentinJ commented Aug 1, 2019

Liujingxiu23 commented Aug 2, 2019

magneter commented Aug 3, 2019

Liujingxiu23 commented Aug 5, 2019

xw1324832579 commented Aug 7, 2019

CorentinJ commented Aug 7, 2019

shawwn commented Aug 10, 2019

WendongGan commented Aug 16, 2019

CorentinJ commented Aug 16, 2019

carlsLobato commented Apr 20, 2021

abelab1982 commented May 2, 2021

carlsLobato commented May 4, 2021

andreafiandro commented May 28, 2021

ghost commented Jun 6, 2021 • edited by ghost Loading

VitoCostanzo commented Jun 15, 2021

andreafiandro commented Jun 21, 2021

ghost commented Jun 21, 2021

tiomaldy commented Jun 24, 2021

selcuk-cofe commented Jul 17, 2021

babysor commented Aug 7, 2021

ghost commented Oct 8, 2021

Hiraokii commented Oct 16, 2021

rphad23 commented Nov 20, 2021

ugurpekunsal commented Dec 25, 2021 • edited Loading

neonsecret commented Apr 15, 2022

Abdelrahman-Shahda commented Apr 29, 2022

neonsecret commented Apr 29, 2022

Abdelrahman-Shahda commented Apr 29, 2022

neonsecret commented Apr 29, 2022

pauortegariera commented Mar 17, 2023

keshawnhsieh commented Mar 28, 2023

keshawnhsieh commented Mar 29, 2023

yaguangtang commented Jul 2, 2019 •

edited by ghost

Loading

CorentinJ commented Jul 2, 2019 •

edited

Loading

tail95 commented Jul 4, 2019 •

edited

Loading

JasonWei512 commented Jul 19, 2019 •

edited

Loading

JasonWei512 commented Jul 20, 2019 •

edited

Loading

CorentinJ commented Jul 20, 2019 •

edited

Loading

ghost commented Jun 6, 2021 •

edited by ghost

Loading

ugurpekunsal commented Dec 25, 2021 •

edited

Loading