From monotone march to expressive symphony, AI-powered voices whisper possibilities: audiobooks with the author's soul, stories narrated in forgotten tongues, and connections beyond the veil. This project covers everything you need to get started with Text-to-Speech AI, exploring its technical underpinnings, its recent advancements, and its diverse applications across various industries. We will examine the ethical considerations surrounding voice cloning and the future potential of this technology to reshape how we interact with information and create content.
In the heart of the digital soundscape lies the fascinating technology of Text-to-Voice AI, where written words seamlessly transform into spoken expressions. While the final output may sound seamless, there's an intricate interplay of components working behind the scenes. Let's break down this technological symphony:
- Text Pre-Processing
- Text to Phoneme Conversion
- Prosody Prediction
- Speech Synthesis
- Post Processing
For centuries, the quest to make machines speak sounded like robots stuck on repeat. From bellows and reeds to early digital squawks, text-to-speech was more sci-fi nightmare than technological marvel. But then came the AI symphony. Deep learning algorithms, trained on vast libraries of human voices, now generate speech so nuanced and expressive it rivals the spoken word. This newfound eloquence unlocks a treasure trove of possibilities: from empowering the visually impaired to narrating audiobooks with the author's touch, AI voices are shaping how we consume, create, and even grieve. As ethical frameworks guide its development, this technological symphony promises to reshape communication, amplify diverse voices, and weave a richer tapestry of human connection.
TTS technology as discussed earlier is not new. However, with the advancement of AI, the generated output has got a lot more natural and blurs the line between actual speech and generated speech.
There are countless tools to try out Text-to-Speech, both open-source and commercial. Among the open sources ones, here are the most widely used:
- Bark: Text-Prompted Generative Audio Model
- PlayHT: AI Voice Generator
- HierSpeech++: The official implementation of HierSpeech++
- ElevenLabs: Text to Speech & AI Voice Generator
Let's start with the easiest way to use voice cloning and TTS - PlayHT
Visit Play ht and create a free account. The service allows you to clone a single voice for free and generate speech from text.
PlayHT allows you to generate voices from the existing voices or clone a new voice. To use the existing voices, click on the name of the voice above the text input, and you can search and select any voice you like. They have amazing voices that you can try out to narrate blocks of text you provide.
The real fun is using your own voice or a voice you want to clone. The tool allows you to do just that. Click on "Voice Cloning" and follow the simple steps provided.
Click on "Instant" to create a clone from a "30 Sec" audio recording.
Then click on "Create New Model" and select the "PlayHT 2.0" model. Now when you click the name of the voice as before you will be able to select your newly cloned voice.
Then, add your text and click "Generate Speech" or hit the Play button
Bark is Suno's text-to-audio model that's capable of generating highly realistic speech from text. Bark goes beyond the basics, effortlessly generating natural-sounding, multilingual speech. But it doesn't stop there – it can create all sorts of audio, from music and background noise to simple sound effects. Bark even adds a human touch with nonverbal cues like laughter, sighs, and crying.
To get started, click the link to visit the Google Colab notebook.
The interface is pretty straight forward, hit the play button besides the "Cells" - this are each greyed areas that have code inside them. You can try out various voices and languages. For list of supported voices checkout the Bark's voice prompt library
Interesting feature of Bark is its ability to incorporate non-speech sounds such as laughter, sighs, music (although not great currently) ... etc
[[laughter]
[laughs]
[sighs]
[music]
[gasps]
[clears throat]
—
or ...
for hesitations
♪
for song lyrics
CAPITALIZATION for emphasis of a word
Two caveat about bark are although it supports voice cloning, it does not provide this feature out of the box. Another issue you might face is the limitation with the length of audio you can generate. In order to address this issues check out the below two projects
- Adobe podcast: Clean up the generated voices and make them even more realistic.
- Mp3Cut: Online MP3 Cutter to cut out a piece of music.
- Convertio: Easy tool to convert files online.
Questions? Feedback? Requests? Discord: Samej2023