ViDubb is an advanced AI-powered video dubbing solution focused on delivering high-quality, efficient dubbing for multilingual content. By utilizing cutting-edge voice cloning technology, ViDubb generates realistic voiceovers in multiple languages with exceptional accuracy. The system ensures perfect lip-sync synchronization, matching the voiceovers to the actors' movements, providing a seamless viewing experience. This approach not only enhances the natural flow of dialogue but also preserves the authenticity of the original video. ViDubb streamlines the dubbing process, enabling faster turnaround times while maintaining top-tier audio and visual quality for global audiences ( AI video dubbing ).
Original Video in French | ViDubb With Background in English | ViDubb Without Background in English |
---|---|---|
origin1.-.Made.with.Clipchamp.1.mp4 |
output.-.Made.with.Clipchamp.1.mp4 |
denois-.Made.with.Clipchamp.1.mp4 |
Original Video | ViDubb with LipSync |
---|---|
Table of Contents
ViDubb is an advanced AI-powered video dubbing solution designed to deliver high-quality, efficient dubbing for multilingual content. By integrating cutting-edge voice cloning technology and dynamic lip-sync synchronization, ViDubb ensures that voiceovers are perfectly aligned with the original video’s dialogue and actor movements, even when multiple speakers are involved, providing a seamless viewing experience across languages.
Leveraging state-of-the-art AI, ViDubb sets new standards in dubbing accuracy and naturalness, making it ideal for global content localization, film, media, and educational purposes. The tool enables content creators and businesses to quickly adapt their videos for international audiences while maintaining top-tier audio and visual quality.
- Download Direct Video from YouTube: Allows users to download videos directly from YouTube for immediate dubbing and localization, saving time and simplifying the workflow.
- Multi-Language Support: Offers dubbing in a variety of languages, ensuring broad global accessibility.
- AI Voice Cloning: Creates realistic, high-quality voiceovers that capture the tone and emotion of the original content.
- Dynamic Lip-Sync Technology: Ensures perfect synchronization with video visuals, even when multiple speakers are involved, enhancing realism and interactivity.
- Background Sound Preservation: Retains original background sounds to maintain the authenticity of the video.
- Efficient Dubbing Process: Streamlines the video dubbing workflow, enabling faster and more cost-effective localization.
- Sentence Tokenization: Breaks down content into manageable segments for better translation and synchronization.
- Speaker Diarization: Identifies and separates speakers in the audio, ensuring accurate voice assignment for each speaker during dubbing.
- Web Interface Support: Provides an intuitive web interface for easy upload, management, and control of dubbing projects.
- CPU and GPU Compatibility: Works seamlessly on both CPU and GPU systems, optimizing performance based on available resources.
Our mission is to provide an efficient and high-quality AI-driven dubbing solution that empowers content creators to expand their global reach, bringing videos to audiences in multiple languages with perfect synchronization and immersive quality.
- Implement sentence summarization.
- Improve the Dynamic Lip-Sync Technology with a lot of speakers.
- Deploy ViDubb on HuggingFace space
ViDubb is an AI-powered video dubbing project that involves voice cloning, multilingual capabilities, lip-syncing, and background sound preservation. Follow the steps below to set up and run ViDubb.
Before starting, ensure you have Anaconda installed on your system. Anaconda is used to manage Python environments and dependencies.
-
Remove any existing environment (if necessary):
conda remove -n vidubbtest --all
-
Create a new conda environment with Python 3.10.14 and IPython:
conda create -n "vidubbtest" python=3.10.14 ipython
-
Activate the environment:
conda activate vidubbtest
-
Clone the ViDubb repository from GitHub:
git clone https://github.com/medahmedkrichen/ViDubb.git
-
Navigate to the ViDubb directory:
cd ViDubb
- Set up the
.env
file with your Hugging Face API and Groq API tokens:- Create a
.env
file in theViDubb
directory. - Add the following lines:
HF_TOKEN="your_huggingface_token" Groq_TOKEN="your_groq_token"
- Create a
Note
You can obtain your HF_TOKEN
from Hugging Face to use the speaker separation, make sure to request access to pyannote/speaker-diarization-3.1.
You can obtain your Groq_TOKEN
from GroqCloud to use the free API model 'llama3-70b' for translation instead of the standard model (optional).
Tip
"llama3-70b" is effective for translating languages in the Latin language family, but it is not as effective for languages like Arabic or Mandarin. If you choose not to use it, leave the groq field empty.
-
Install FFmpeg (for audio/video processing):
sudo apt-get install ffmpeg
-
Install Python dependencies from the
requirements.txt
file:pip install -r requirements.txt
-
Install PyTorch with CUDA support for GPU acceleration:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
-
Check if CUDA is available: Open a Python shell and run the following:
import torch print(torch.cuda.is_available())
-
Download the Wav2Lip model:
wget 'https://iiitaphyd-my.sharepoint.com/personal/radrabha_m_research_iiit_ac_in/_layouts/15/download.aspx?share=EdjI7bZlgApMqsVoEUUXpLsBxqXbn5z8VTmoxp55YNDcIA' -O 'Wav2Lip/wav2lip_gan.pth'
-
Download the face detection model:
wget "https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth" -O "Wav2Lip/face_detection/detection/sfd/s3fd.pth"
-
Run the inference script to process a video:
python inference.py --yt_url "https://www.youtube.com/shorts/ULptP9egQ6Q" --source_language "en" --target_language "fr" --LipSync True --Bg_sound True
This command will:
- --yt_url: Download the video from YouTube you can change it to "--video_url" if you want to work with local file.
- --LipSync True: Perform lip-sync translation
- --source_language "en" from English
- --target_language "fr" to French.
- --Bg_sound True preserve the bacground sounds in wanted
- Output a dubbed video with lip-syncing in results.
More options:
usage: inference.py [-h] (--yt_url YT_URL | --video_url VIDEO_URL)
--source_language SOURCE_LANGUAGE --target_language
TARGET_LANGUAGE [--whisper_model WHISPER_MODEL]
[--LipSync LIPSYNC] [--Bg_sound BG_SOUND]
Choose between YouTube or video URL
options:
-h, --help show this help message and exit
--yt_url YT_URL YouTube single video URL
--video_url VIDEO_URL
Single video URL
--source_language SOURCE_LANGUAGE
Video source language
--target_language TARGET_LANGUAGE
Video target language
--whisper_model WHISPER_MODEL
Chose the whisper model based on your device
requirements
--LipSync LIPSYNC Lip synchronization of the resut audio to the
synthesized video
--Bg_sound BG_SOUND Keep the background sound of the original video,
though it might be slightly noisy
Tip
--Bg_sound True: can lead to more noise in some videos with less background sound in origin video
--LipSync True: will take more time and more memory
-
Start the web application:
python app.py
-
Access the app by opening a browser and going to:
http://localhost:7860/
By following these steps, you should be able to set up and run ViDubb for video dubbing with AI-powered voice and lip synchronization.
The provided code implements a robust video dubbing pipeline, leveraging various machine learning and audio/video processing techniques. Here's a detailed breakdown of the key features and their underlying technical implementations:
- Speaker Diarization
- Technical Implementation: Employs the
pyannote.audio
library, a state-of-the-art speaker diarization toolkit. It segments the audio into speaker turns, allowing for accurate identification of who is speaking at any given time.
- Lip-Sync (Optional)
- Technical Implementation:
- Frame Extraction: Uses OpenCV to extract frames from the video based on the speaker diarization results.
- Face Detection: Leverages the Haar Cascade classifier or a more advanced deep learning-based face detector to locate faces within each frame.
- Face Alignment and Normalization: Prepares the detected faces for further processing by aligning them to a standard template.
- Lip-Sync Model: Employs a pre-trained lip-sync model, such as Wav2Lip, to generate realistic lip movements based on the input audio and extracted facial features.
- Text Transcription
- Technical Implementation: Leverages the Whisper model, a robust speech-to-text model, to transcribe the audio content of the video into text. This provides a textual representation of the audio, which is crucial for subsequent text-based processing.
- Sentence Segmentation
- Technical Implementation: Utilizes the NLTK library's sentence tokenization capabilities to divide the transcribed text into meaningful sentences. This segmentation is essential for accurate translation and emotion analysis.
- Text Translation
- Technical Implementation:
- Direct Translation: Leverages the MarianMT model, a powerful machine translation system, to translate the segmented sentences from the source language to the target language.
- Context-Aware Translation (Optional): Employs a large language model (LLM) like "llama3-70b-8192" to provide context-aware translations, improving the quality and accuracy of the translations.
- Emotion Analysis (Optional)
- Technical Implementation: Leverages a pre-trained emotion recognition model, such as the one provided by SpeechBrain, to analyze the emotions expressed in the audio segments. The model classifies emotions into categories like anger, happiness, sadness, and neutral.
- Audio Synthesis
- Technical Implementation: Employs a text-to-speech (TTS) system, such as the one provided by the TTS library, to synthesize audio from the translated text. The TTS system can be further customized to match the speaker's voice and emotion.
- Audio and Video Synchronization
- Technical Implementation: Leverages FFmpeg to synchronize the generated audio with the original video, ensuring that the lip movements align with the spoken words.
- Audio and Video Mixing
- Technical Implementation: Employs libraries like PyDub to mix the original video with the newly generated audio, creating the final dubbed video.
By combining these techniques and leveraging the power of machine learning, the code effectively addresses the challenges of video dubbing, delivering high-quality results.
- Linly Dubbing
- Wav2Lip
- freeCodeCamp
- HuggingFace video-dubbing
- Kaggle free Notebook
- Colab free Notebook
- All open source models :)
Caution
When using this tool, please comply with relevant laws, including copyright, data protection, and privacy laws. Do not use this tool without permission from the original author and/or rights holder.
ViDubb
follows the Apache License 2.0. When using this tool, please comply with relevant laws, including copyright, data protection, and privacy laws. Do not use this tool without permission from the original author and/or rights holder.