Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use gruut for phonemization #523

Closed
wants to merge 1 commit into from
Closed

Conversation

synesthesiam
Copy link
Contributor

Re-enable phoneme-based models using gruut (documentation, license).

Supported languages:

  • Czech
  • German
  • English
  • Spanish
  • Farsi/Persian
  • French
  • Italian
  • Dutch
  • Russian
  • Swedish

All supported languages have a pronunciation lexicon and a pre-trained grapheme-to-phoneme model for guessing pronunciations. English and French have pre-trained part-of-speech taggers that are used to resolve ambiguous pronunciations and add liasons respectively.

Mismatched phonemes

gruut and eSpeak different slightly in the IPA they produce, so a GRUUT_PHONEME_MAP was added to text/__init__.py to "fix" phonemes so existing pre-trained TTS models sound right. Another option is to re-train these models with gruut's IPA set, which is derived from Wikipedia's language-specific phonology pages.

Text cleaner interference

An important TODO is to re-examine the need for some of the text cleaners. gruut supports regex replacements, abbreviation expansion, and numbers/currency to words. See the English tokenizer for some examples.

So far, the biggest text cleaner problem has been found with the pre-trained French TTS model (tts_models/fr/mai/tacotron2-DDC). The default phoneme_cleaners remove hypens (-), which are used to resolve specific French pronunciations in gruut, such as "est-ce" and "est-ce-que". Without hyphens, gruut will string together phonemes for "est", "ce", and "que" which sounds wrong.

@@ -25,6 +26,33 @@
# Regular expression matching punctuations, ignoring empty space
PHONEME_PUNCTUATION_PATTERN = r"[" + _punctuations.replace(" ", "") + "]+"

# language -> source phoneme -> dest phoneme
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this additional mapping?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gruut and eSpeak don't agree completely on the phoneme inventories for each language. Some differences are cosmetic, like the use of /ɡ/ (0x261) vs. /g/ (0x67) -- both are acceptable IPA.

I made choices with some languages that may cause problems in multi-lang models. For example, my Dutch phonemes list uses /ɹ/ and /w/ when it should really be using /r/ and /ʋ/ (I know better now). These are easy to fix with a static map, and I plan to correct them in later versions of gruut.

Other differences are more problematic, and I'm not sure exactly what to do. Consider the sentence "responsible bee city". eSpeak phonemizes it like this:

$ espeak-ng -v en-us -qx --ipa --sep=' ' 'responsible bee city'
 ɹ ᵻ s p ˈɑː n s ᵻ b əl  b ˈiː  s ˈɪ ɾ i

Each "e" sound is a little different: /ᵻ/, /iː/, or /i/

gruut is much more simplistic:

$ bin/gruut en tokenize 'responsible bee city' | bin/gruut en phonemize | jq -r .pronunciation_text
ɹ i s p ˈɑ n s ɪ b ə l b ˈi s ˈɪ t i

Now they're all just /i/ because gruut was designed to operate on phonemes rather than phones. I assumed that the realization of the phoneme (short, long, etc.) would be learned by the machine learning model (which seems to have worked at least in Larynx).

Some options:

  • Fix cosmetic/multi-lang phonemes with a built-in static phoneme map
  • Allow existing eSpeak-based models to include a custom phoneme map in their configs
  • Re-train available models with gruut phoneme sets

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@synesthesiam : I don't understand why you say gruut is much more simplistic for the sentence responsible bee city ? The generated phonemes are defined this way in your /en-us/lexicon.txt :

bee _ b ˈi
bee's _ b ˈi z

city _ s ˈɪ t i
city's _ s ˈɪ t i z

responsibility _ ɹ i s p ˌɑ n s ɪ b ˈɪ l ə t i
responsible _ ɹ i s p ˈɑ n s ɪ b ə l
responsibly _ ɹ ɪ s p ˈɑ n s ɪ b l i

If you change the pronunciation in the english lexicon, create a new SQlite lexicon.db and a new g2p.fst model, you get the same results as espeak-ng.

Here is the result for my luxembourgish gruut model with the luxembourgish The North Wind and the Sun sentence :

mbarnig@mbarnig-MS-7B22:~/gruut$ bin/gruut lb tokenize 'An der Zäit hunn sech den Nordwand an d’Sonn gestridden, wie vun hinnen zwee wuel méi staark wier, wéi e Wanderer, deen an ee waarme Mantel agepak war, iwwert de Wee koum. ' | bin/gruut lb phonemize | jq -r .pronunciation_text

ɑ n d ɐ ts æːɪ t h u n z ə ɕ d ə n n ɔ ʀ d v ɑ n t ɑ n  g ə ʃ t ʀ i d ə n | v iə f u n h i n ə n ts w eː v uə l m ɜɪ ʃ t aː ʀ k v iː ɐ | v ɜɪ ə v ɑ n d ə ʀ ɐ | d eː n ɑ n eː v aː ʀ m ə m ɑ n t ə l  v aː ʀ | i v ɐ t d ə v eː k əʊ m ‖

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbarnig, you're correct -- this is not a technical limitation of gruut itself. It's just that the phoneme set I selected for U.S. English is smaller than what eSpeak has.

This has got me thinking of a new idea, though. Perhaps I should store pronunciations in their original (larger) phoneme set, and then allow the user to select their desired phoneme set. This would solve one problem I've had with needing to convert between various ASR phoneme sets.

Another facet of this idea that I'd like to explore more is "accented speech", where you approximate the phoneme set of one language in another. I have a toy version of this already in gruut/larynx using manually created maps between a few languages. It would be cool to see if it could be partially automated, so you could for example use an English TTS voice to speak a French sentence.

Copy link
Member

@erogol erogol Jun 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@synesthesiam I agree that using simpler phoneme representation would be adjusted by the model. So I am fine with using simple /i/ in that example. Also if we go a level deeper, then we need to generate phonemes contextually and it is a really hard task. It also makes it harder for the model to correct mistakes of the G2P interface when it encounters certain edge cases. Also in terms of accessibility and adding new languages, I'd prefer the simplest way possible assuming that the model will adjust with a good TTS dataset.

What I don't understand is that IPA representations should be universal and if it is the case (I might be wrong), why do we need an explicit static mapping as below for the languages? Is it just for compatibility with your previous models or am I missing here?

From an ML perspective, what is important for 🐸TTS is that guut generates the same phonemes every time for the same sounds. This is actually what should happen in G2P for any language, right? That's what I mean by saying IPA is universal. If gruut can provide this consistently then it is just enough for 🐸TTS.

Hope it makes sense.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@synesthesiam : I tested your idea with the luxembourgish sentence
An der Zäit hunn sech den Nordwand an d’Sonn gestridden.
The correct IPA phonemization is
|ɑ|n| |d|ɐ| |ʦ|æ:ɪ|t| |h|u|n| |z|e|ɕ| |d|ə|n| |n|o|ʀ|t|v|ɑ|n|t| |ɑ|n| |d|z|o|n |g|ə|ʃ|t|ʀ|i|d|ə|n|.
The german espeak-ng phonemization for this sentence is
|a|n| |d|ɛ|ɾ| |ts|ɛː|ˈɪ|t| |h|ˈʊ|n| |z|ˈɛ|ç| |d|eː|n |n|ˈɔ|ɾ|d|v|a|n|t| |a|n| |d|ˈeː|z|ˈɔ|n| |ɡ|ə|ʃ|t|ɾ|ˈɪ|d|ə|n|.
The synthesized speech with the german model is not really intelligible.
I changed three phonemes in the luxembourgish phoneme sequence, as follows :
|ɑ|n| |d|ɐ| |ts|aɪ|t| |h|u|n| |z|e|ç| |d|ə|n| |n|o|ʀ|t|v|ɑ|n|t| |ɑ|n| |d|z|o|n |g|ə|ʃ|t|ʀ|i|d|ə|n|.
Now Thorsten (@thorstenMueller) speaks luxembourgish, with a german accent.

coqui-tts-server-lb

@erogol ; @synesthesiam : I think we have a different understanding and experience of IPA. Perhaps we should investigate in depth how to build the multilingual phonemizer to get a clear view.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erogol I agree that the most important aspect is that the same phonemes are produced in the same context.

I'm not an expert in IPA, but I would say it's "universal" in the sense that it's a formal notation for how humans vocalize. But the level of detail varies wildly, spanning from perfectly capturing how one person said a specific utterance up to high-level phonemes that all speakers of a language share. Depending on the task, you'll likely choose a different level of detail.

Compared to espeak, gruut currently has "lower resolution" phonemes. That matters when trying to use gruut with pre-trained 🐸 TTS models that were using espeak's phones/phonemes, but my question is: does 🐸 TTS need the "higher resolution" or will newly-trained models simply figure it out themselves (like they do for intonation)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@synesthesiam yep my background is also ML thus I know phonemes as much as I know them from dictionaries :)

Therefore, I believe it is better to delegate all these complexities to the model and the dataset to allow more people to train models.

So far, I see that if the dataset is good then the model works with rough phonemes and even sometimes it adjusts some phoneme errors.

In the end, for the model, it is important to see the same sequence of phoneme characters for the same sound consistently. It does not care if the phonemization is correct or not.

However, the downside is that then the model would not produce consistent results for raw phoneme inputs.

But here the famous 80-20 rule applies I guess. I guess first we need to have a consistent G2P interface that is easy to use. Then let people try and refine the process by collecting feedback. The feedback part is also important since different languages also have different needs.

It is probably not the right place but I post it anyways since we are all here. I think the imperatives of G2P in 🐸TTS right now are

  1. Text normalization (numbers, acronyms, titles, etc.)
  2. Consistent G2P
  3. Easy to use so that people with no or little knowledge can still train models.
  4. Cover as many langs as possible.

Would you agree with these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erogol I would definitely agree with those 4 points.

However, the downside is that then the model would not produce consistent results for raw phoneme inputs.

gruut might be able to help a bit here. If you take a look at the phonemes.txt file for U.S. English, the values after each example word are alternative forms for the phoneme. gruut is capable of splitting/normalizing a raw phoneme string, dipthongs and all.

By the way, my background is not in ML (nor even computational linguistics), so please feel free to correct me. In fact, a good deal of my ML knowledge has come from hacking on your code 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now Thorsten (@thorstenMueller) speaks luxembourgish, with a german accent.

This is great, @mbarnig! I would think this technique will work quite well for related language families. I also expect voices trained on phoneme-rich languages to be more versatile.

Knowing exactly how to alter the mismatched phonemes is the hard part (for me), so finding multi-lingual people is key!

@erogol
Copy link
Member

erogol commented Jun 3, 2021

@synesthesiam can you add some tests?

We used to have a set of test cases for the prev phoneme solution.

Maybe you can pull out from the history tests/test_text_processing.py

@synesthesiam
Copy link
Contributor Author

Ok, I'll wait to add tests until we resolve the phoneme map discussion above 🙂

@mbarnig
Copy link

mbarnig commented Jun 5, 2021

I look with great interest at the present pull-request to combine gruut and Coqui-TTS. I would like to add my 2 cents concerning the discussion about phoneme mapping.

I suggest to separate the phoneme generation/mapping (in gruut) from the speech generation/training (in Coqui-TTS), as it is done in the interface between gruut and larynx.

In the Coqui-TTS discussion thread How to feed in phonemes ... @Mu-Y proposed to replace the lines 71 and 72 of the phoneme_to_sequence function in TTS/tts/utils/text/__init__py with the code to_phonemes=text. I tried this modification and it works great to enter your own IPA phonemes for inference and training with Coqui-TTS.

One of my problems to create a luxembourgish voice is the existence of specific luxembourgish diphtongs like æːɪ Z[äi]t or æːʊ R[au]m with 3 symbols for one IPA-phoneme, which can't be handled by espeak-ng. My lb-language-model created in gruut supports them as expected.

This is one of the reasons why I think that Coqui-TTS should fully comply to IPA in the future.

To assure the compatibilty between gruut and already released Larynx- and Coqui-TTS models, I suggest a temporary mapping solution, controlled with a flag --use_phoneme_map True. I think the mapping should be done outside of Coqui-TTS, for example with a separate tool, or best, inside gruut.

@erogol
Copy link
Member

erogol commented Jun 6, 2021

@mbarnig

I suggest to separate the phoneme generation/mapping (in gruut) from the speech generation/training (in Coqui-TTS), as it is done in the interface between gruut and larynx.

"Separate" means providing phonemes explicitly to 🐸TTS and not dealing with G2P in 🐸TTS at all, is this right? If it is the case, how would you like to create a generic G2P interface. Or do you suggest releasing G2P with each released TTS model?

I try to understand the consequences of this design choice. In the end we try to let people train models without being experts. Even I am not an expert in that part of TTS :).

In the Coqui-TTS discussion thread How to feed in phonemes ... @Mu-Y proposed to replace the lines 71 and 72 of the phoneme_to_sequence function in TTS/tts/utils/text/__init__py with the code to_phonemes=text. I tried this modification and it works great to enter your own IPA phonemes for inference and training with Coqui-TTS.

This also suggests optionally inputting phonemes instead of graphemes to 🐸TTS endpoints right?

This is an interesting idea and I am not sure if NN models are capable of capturing the exact sound of the phonemes. My thinking is that the models learn the phoneme sequences and they go word by word. So I don't think they can process phonemes individually.

One of my problems to create a luxembourgish voice is the existence of specific luxembourgish diphtongs like æːɪ Z[äi]t or æːʊ R[au]m with 3 symbols for one IPA-phoneme, which can't be handled by espeak-ng. My lb-language-model created in gruut supports them as expected.

In our current setup, it should not be a problem since even multi-character phonemes are processes character by characters. Meaning, if it is your phoneme [au] then it processes as [, a, u, ]. So the model learns to correlate the correct sound with the sequences of IPA characters essentially.

To assure the compatibilty between gruut and already released Larynx- and Coqui-TTS models, I suggest a temporary mapping solution, controlled with a flag --use_phoneme_map True. I think the mapping should be done outside of Coqui-TTS, for example with a separate tool, or best, inside gruut.

This is again for just inference right ?

@synesthesiam
Copy link
Contributor Author

For training with 🐸 TTS, you can directly generate the .npy files for each utterance to bypass the text to phoneme process. Larynx does this explicitly -- the training CSV input is literally id|P1 P2 P3... where each P is a phoneme index. Larynx models have no knowledge of their phoneme sets besides how many there are.

If I understand @mbarnig correctly, the suggestion is to allow direct integer array input for training/inference in 🐸 TTS. In the [au] example, it may be important that this is represented as a single symbol to the model if, for example, [a] and [u] don't both separately occur. I hypothesize that this is more important for smaller datasets (e.g. under-resourced languages), but I don't have any data to back that up.

I do think it's critical that there be a "default" text to phoneme system built in to 🐸 TTS (gruut or otherwise), since most users are going to be showing up with text.

@mbarnig
Copy link

mbarnig commented Jun 6, 2021

@erogol

In the end we try to let people train models without being experts.

I appreciate the motto Freeing Speech and the goals providing open speech tech for everyone and supporting low-resource language communities of Coqui.ai. I think this is a great objective and I fully adhere to it.

I know that it's possible to train an ML-TTS-model from raw text without phoneme conversion, but the required dataset is huge and the training time is very high. To make TTS-technology available to everyone we need to work with dataset-sizes of a few hours and training-times of a few days on a Desktop-PC with a standard NVIDIA card, or on Google Colab with GPU runtime.

For these reasons we need to train the models with phonemes and to agree on rules how to transform characters (graphemes) into a set of phonemes.

If my understanding is correct we have currently two different methods to transform characters (graphemes) into a set of phonemes used by Coqui-TTS and Rhasspy-gruut/larynx.

Coqui-TTS uses a list of phonemes in the TTS/tts/utils/text/symbols.pyor in the configuration file like

# Phonemes definition (All IPA characters)
_vowels = "iyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻ"
_non_pulmonic_consonants = "ʘɓǀɗǃʄǂɠǁʛ"
_pulmonic_consonants = "pbtdʈɖcɟkɡqɢʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟ"
_suprasegmentals = "ˈˌːˑ"
_other_symbols = "ʍwɥʜʢʡɕʑɺɧʲ"
_diacrilics = "ɚ˞ɫ"
_phonemes = _vowels + _non_pulmonic_consonants + _pulmonic_consonants + _suprasegmentals + _other_symbols + _diacrilics

Each phoneme symbol generated by espeak-ng is looked-up in the coqui-TTS phoneme list and converted into a phoneme-index for training or inference.

coqui-phoneme-index

Rhasspy-gruut/larynx uses a map of phonemes for each language with the related index as input for training and inference. Here is an example for luxembourgish :

.... 
# specific luxembourgish monophthongs
1 ɑ k[a]pp
2 i m[i]dd
3 e m[é]ck
4 æ h[e]ll
5 o spr[o]ch
6 u g[u]tt

# specific luxembourgish diphtongs
7 iə h[ie]n
8 ɜɪ fr[éi]
9 æːɪ z[äi]t
10 ɑɪ l[ei]t
11 uə b[ue]dem
12 əʊ sch[ou]l
13 æːʊ r[au]m
14 ɑʊ [au]to
....

The phoneme æːɪ in the word Zäit is converted into three indexes [28, 110, 63] by Coqui-TTS and into one index [9] by Rhasspy-gruut/larynx. This results in lower training times and better voices for Rhasspy-gruut/larynx models.

Therefore @synesthesiam is wrong when he thinks that his model has "lower resolution" phonemes. My opinion is that they are "higher resolution".

"Separate" means providing phonemes explicitly to Coqui-TTS and not dealing with G2P in Coqui-TTS at all.

Yes, this is what I mean, but I don't understand what you consider as a generic G2P interface of a Coqui-TTS model. I see only 3 files in a model archive :

  • config.json
  • model_file.pth.tar
  • scale_stats.npy

I am not sure if NN models are capable of capturing the exact sound of the phonemes.

I am sure, this is what is done by Rhasspy-gruut/larynx.

Meaning, if it is your phoneme [au] then it processes as [, a, u, ]

The model is learning this, but with great efforts.

This is again for just inference right ?

My proposal for a temporary solution is just for inference to use the existing models.

I will create a post at my website to compare a few Coqui-TTS and Rhasspy-gruut/larynx voices and to provide more details about the phonemes.

@erogol
Copy link
Member

erogol commented Jun 6, 2021

I totally see your point. They are all valid. And thx for taking the time.

I consider a generic G2P as an API that does conversion on the fly so it doesn't mean an additional file.

I am strongly against generating more artifacts if there is a comparable on the fly solution. More files means more problems.

Being said that I don't see any reason for not implementing both options together. Meaning, an easy to use programic API that totally abstracts all the details away and a good external mapping API that takes a certain format of mapping and use it for the model. But I favor finishing the programic API first.

As a side note...

Yes learning small datasets is a problem but I don't think learning from scratch is a technically feasible and scalable solution due to all the details that a minuscule of it we discussed here.

We aim solving the data issue first with multispeaker models and then multi-lingual models. People can train new voices just finetuning these. We believe, that also helps you to skip all the preluding complexities.

BTW I think we need to move this discussion to the discussions. Let's continue there...

@EmElleE
Copy link

EmElleE commented Jun 9, 2021

@synesthesiam
Would this be ready for training? I want to test this out and pretrain on GlowTTS

@synesthesiam synesthesiam deleted the add-gruut branch June 9, 2021 17:56
@synesthesiam synesthesiam restored the add-gruut branch June 9, 2021 17:56
@synesthesiam synesthesiam mentioned this pull request Jun 9, 2021
@synesthesiam
Copy link
Contributor Author

Sorry, I inadvertently closed this pull request after rebasing my work on upstream/dev and failing to push a commit (not my own) that Github decided was under the "workflow" scope.

Long story short, re-created here: #561

@synesthesiam
Copy link
Contributor Author

@EmElleE Yes, this should be ready for training! I'm starting on a Tacotron2 Dutch model based on the rdh dataset.

@EmElleE
Copy link

EmElleE commented Jun 10, 2021

@EmElleE Yes, this should be ready for training! I'm starting on a Tacotron2 Dutch model based on the rdh dataset.

Great gonna try glow tts =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants