DoyenTalker

DoyenTalker is a project that uses deep learning techniques to generate personalized avatar videos that speak user-provided text in a specified voice. The system utilizes Coqui TTS for text-to-speech generation, along with various face rendering and animation techniques to create a video where the given avatar articulates the speech.

Features

Text-to-Speech (TTS): Converts a user-provided text message into speech using the Coqui TTS engine.
Avatar-based Animation: Creates a video where a user-selected avatar speaks the generated speech.
Customizable Voice: Users can specify a voice sample to have the avatar speak in that voice.
Multilingual Support: Supports multiple languages for speech synthesis (English, Spanish, French, German, and more).
Face Rendering: Incorporates pose and eye-blink reference videos to enhance facial expression realism.
Batch Processing: Supports the generation of videos in batches, useful for processing long texts by splitting them into smaller chunks.
Face Enhancer (Optional): Optionally uses face enhancement models such as GFP-GAN or RestoreFormer to improve the quality of the generated avatar’s face.
Background Enhancer (Optional): Uses Real-ESRGAN to enhance background visuals in the generated video.

How It Works

Input Text : The user provides a text message that they want the avatar to speak. The text is split into manageable chunks if it exceeds a certain length, ensuring efficient processing.
Avatar Image: An avatar image is selected, which will be used as the visual representation of the character that will speak the text. The system processes this image to prepare it for animation.
Voice Sample: A voice sample is provided by the user. This voice will be used to generate the speech for the text message. The user can choose from a variety of languages and voice options supported by Coqui TTS, such as English, Spanish, French, German, and others.
Speech Generation (Coqui TTS): Using Coqui TTS, the system generates speech from the input text in the specified voice. The speech is split across multiple audio files if the text has been chunked.
Face Rendering and Animation: The avatar’s face is animated to match the generated speech. The system processes the avatar image using 3DMM (3D Morphable Model) extraction techniques to capture facial expressions. It also integrates reference videos for eye-blinking and head movements to ensure natural-looking animations.
Video Generation: Finally, the audio and animated avatar are combined into a video. The video can be rendered with custom poses, facial expressions, and enhanced visuals using optional face and background enhancement techniques.
Output Video: The result is a video where the avatar accurately speaks the input text in the user-specified voice.

Installation

This steps need to follow after git clone.

  pip install uv

  uv venv
  .venv\Scripts\activate

  uv pip install -r requirements.txt

  python main.py  --message_file "/content/drive/MyDrive/voice_cloning_data/test_message.txt" --voice "/content/DoyenTalker/backend/assets/voice/ab_voice.mp3" --lang en --avatar_image "/content/DoyenTalker/backend/assets/avatar/male10.jpeg"

Demo

trump_student.mp4

modi_social_media.mp4

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
backend		backend
frontend		frontend
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DoyenTalker

Features

How It Works

Installation

Demo

About

Releases

Packages

Languages

Aditya1Jhaveri/DoyenTalker

Folders and files

Latest commit

History

Repository files navigation

DoyenTalker

Features

How It Works

Installation

Demo

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages