TTSDiff scan for errors #198

kalle07 · 2024-05-01T09:31:55Z

kalle07
May 1, 2024

TTS Generator - TTSDiff now scans generated text and TTS for errors.

what doeas that mean ?

and if your answer is totaly different maybe an idea:

i know maybe it is not possible to check that ... but it becomes more important.
so if i have a text answer let it be 20 words, the generated sound-file cant be longer than aprox 20sec.
or can you check the soundfile for silince more than 3sec.
or internal iam shure you can play that soundfile (via whisper_stt) and compare it to the written words (but than it need much time), maybe for some fintuning, learning ?!?

you know what i mean ?

erew123 · 2024-05-01T10:24:24Z

erew123
May 1, 2024
Maintainer

Hi @kalle07

So what the ttsdiff is doing, is using the Whisper model to look at the original text that you input to see if matches with the generated TTS wav files. Anything that doesn't match up is flagged within the interface. The documentation on this is here: https://github.com/erew123/alltalk_tts?tab=readme-ov-file#-analyzing-generated-tts-for-errors

RE - so if i have a text answer let it be 20 words, the generated sound-file cant be longer than aprox 20sec.
If you are talking about individual generations Im not sure how you would globally set this to confirm each individual wav, as all of them will always be different in length.

RE - or can you check the soundfile for silence more than 3sec.
Its potentially possible with some kind of advanced analytics, but you are unlikely to have any large silence in audio as thats not exactly how the AI model would generated TTS unless you purposefully instructed it to with lots of punctuation to create gaps (though these are pre filtered out anyway).

Does this cover what you asked?

Thank

1 reply

kalle07 May 1, 2024
Author

Hey,
RE RE - so if i have a text answer let it be 20 words, the generated sound-file cant be longer than aprox 20sec.
RE - If you are talking about individual generations Im not sure how you would globally set this to confirm each individual wav, as all of them will always be different in length.

i meant, that you can count every time the words of the output and estimate if somewhat went wrong.
timelentgh audio vs counted words

it is sometimes there it is a bit "silence - distorted - whispering" such things are heavy to filter out?

and by whisper model I meant one step further ... ok, you get the answer as audio and send it through whisper (at first step maybe micro and record it realtime) it generates the text and you can verify if the answer output text is same the whisper generatet text from the audio...
i know thats a RAW idea, because that needs a lot of time because its real-time, but maybe you have some more thoughts about ;)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TTSDiff scan for errors #198

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

TTSDiff scan for errors #198

kalle07 May 1, 2024

Replies: 1 comment · 1 reply

erew123 May 1, 2024 Maintainer

kalle07 May 1, 2024 Author

kalle07
May 1, 2024

Replies: 1 comment 1 reply

erew123
May 1, 2024
Maintainer

kalle07 May 1, 2024
Author