JoyHallo: Digital human model for Mandarin

📖 Introduction

In audio-driven video generation, creating Mandarin videos presents significant challenges. Collecting comprehensive Mandarin datasets is difficult, and the complex lip movements in Mandarin further complicate model training compared to English. In this study, we collected 29 hours of Mandarin speech video from JD Health International Inc. employees, resulting in the jdh-Hallo dataset. This dataset includes a diverse range of ages and speaking styles, encompassing both conversational and specialized medical topics. To adapt the JoyHallo model for Mandarin, we employed the Chinese wav2vec2 model for audio feature embedding. A semi-decoupled structure is proposed to capture inter-feature relationships among lip, expression, and pose features. This integration not only improves information utilization efficiency but also accelerates inference speed by 14.3%. Notably, JoyHallo maintains its strong ability to generate English videos, demonstrating excellent cross-language generation capabilities.

🎬 Videos-Mandarin-Woman

demo1.mp4

🎬 Videos-Mandarin-Man

demo2.mp4

🎬 Videos-English

demo3.mp4

🧳 Framework

⚙️ Installation

System requirements:

Tested on Ubuntu 20.04, Cuda 11.3
Tested GPUs: A100

Create environment:

# 1. Create base environment
conda create -n joyhallo python=3.10 -y
conda activate joyhallo

# 2. Install requirements
pip install -r requirements.txt

# 3. Install ffmpeg
sudo apt-get update  
sudo apt-get install ffmpeg -y

🎒 Prepare model checkpoints

1. Download base checkpoints

Use the following command to download the base weights:

git lfs install
git clone https://huggingface.co/fudan-generative-ai/hallo pretrained_models

2. Download chinese-wav2vec2-base model

Use the following command to download the chinese-wav2vec2-base model:

cd pretrained_models
git lfs install
git clone https://huggingface.co/TencentGameMate/chinese-wav2vec2-base

3. Download JoyHallo model

For convenience, we have uploaded the model weights to Hugging Face.

Model	Dataset	Hugging Face
JoyHallo	jdh-hallo	JoyHallo

4. pretrained_models contents

The final pretrained_models directory should look like this:

./pretrained_models/
|-- audio_separator/
|   |-- download_checks.json
|   |-- mdx_model_data.json
|   |-- vr_model_data.json
|   `-- Kim_Vocal_2.onnx
|-- face_analysis/
|   `-- models/
|       |-- face_landmarker_v2_with_blendshapes.task
|       |-- 1k3d68.onnx
|       |-- 2d106det.onnx
|       |-- genderage.onnx
|       |-- glintr100.onnx
|       `-- scrfd_10g_bnkps.onnx
|-- hallo/
|   `-- net.pth
|-- joyhallo/
|   `-- net.pth
|-- motion_module/
|   `-- mm_sd_v15_v2.ckpt
|-- sd-vae-ft-mse/
|   |-- config.json
|   `-- diffusion_pytorch_model.safetensors
|-- stable-diffusion-v1-5/
|   `-- unet/
|       |-- config.json
|       `-- diffusion_pytorch_model.safetensors
|-- wav2vec/
|   `-- wav2vec2-base-960h/
|       |-- config.json
|       |-- feature_extractor_config.json
|       |-- model.safetensors
|       |-- preprocessor_config.json
|       |-- special_tokens_map.json
|       |-- tokenizer_config.json
|       `-- vocab.json
`-- chinese-wav2vec2-base/
    |-- chinese-wav2vec2-base-fairseq-ckpt.pt
    |-- config.json
    |-- preprocessor_config.json
    `-- pytorch_model.bin

🚧 Data requirements

Image:

Cropped to square shape.
Face should be facing forward and occupy 50%-70% of the image.

Audio:

Use wav format.
Mandarin, English or mixed, with clear audio and suitable background music.

Notes: These requirements apply to both training and inference processes.

🚀 Inference

1. Inference with command line

Use the following command to perform inference:

sh joyhallo-infer.sh

Modify the parameters in configs/inference/inference.yaml to specify the audio and image files you want to use, as well as switch between models. The inference results will be saved in opts/joyhallo. The parameters in inference.yaml are explained as follows:

audio_ckpt_dir: Path to the model weights.
ref_img_path: Path to the reference images.
audio_path: Path to the reference audios.
output_dir: Output directory.
exp_name: Output file folder name.

2. Inference with web demo

Use the following command to start web demo:

sh joyhallo-app.sh

The demo will be create at http://127.0.0.1:7860.

⚓️ Train or fine-tune JoyHallo

You have two options when training or fine-tuning the model: start from Stage 1 or only train Stage 2 .

1. Use the following command to start training from Stage 1

sh joyhallo-alltrain.sh

This will automatically start training both stages (including Stage 1 and Stage 2), and you can adjust the training parameters by referring to configs/train/stage1_alltrain.yaml and configs/train/stage2_alltrain.yaml.

2. Use the following command to train only Stage 2

sh joyhallo-train.sh

This will start training from Stage 2 , and you can adjust the training parameters by referring to configs/train/stage2.yaml.

🎓 Prepare training data

1. Prepare the data in the following directory structure, ensuring that the data meets the requirements mentioned earlier

joyhallo/
|-- videos/
|   |-- 0001.mp4
|   |-- 0002.mp4
|   |-- 0003.mp4
|   `-- 0004.mp4

2. Use the following command to process the dataset

python -m scripts.data_preprocess --input_dir joyhallo/videos --step 1
python -m scripts.data_preprocess --input_dir joyhallo/videos --step 2

💻 Comparison

1. Accuracy comparison in Mandarin

Model	IQA $\uparrow$	VQA $\uparrow$	Sync-C $\uparrow$	Sync-D $\downarrow$	Smooth $\uparrow$	Subject $\uparrow$	Background $\uparrow$
Hallo	0.7865	0.8563	5.7420	13.8140	0.9924	0.9855	0.9651
JoyHallo	0.7781	0.8566	6.1596	14.2053	0.9925	0.9864	0.9627

Notes: The evaluation metrics used here are from the following repositories, and the results are for reference purposes only:

IQA and VQA: Q-Align
Sync-C and Sync-D: Syncnet
Smooth, Subject, and Background: VBench

2. Accuracy comparison in English

Model	IQA $\uparrow$	VQA $\uparrow$	Sync-C $\uparrow$	Sync-D $\downarrow$	Smooth $\uparrow$	Subject $\uparrow$	Background $\uparrow$
Hallo	0.7779	0.8471	4.4093	13.2340	0.9921	0.9814	0.9649
JoyHallo	0.7779	0.8537	4.7658	13.3617	0.9922	0.9838	0.9622

3. Inference efficiency comparison

	JoyHallo	Hallo	Improvement
GPU Memory (512*512, step 40)	19049m	19547m	2.5%
Inference Speed (16 frames)	24s	28s	14.3%

📝 Citations

If you find our work helpful, please consider citing us:

@misc{JoyHallo2024,
  title={JoyHallo: Digital human model for Mandarin},
  author={Sheng Shi and Xuyang Cao and Jun Zhao and Guoxin Wang},
  year={2024},
  url={https://github.com/jdh-algo/JoyHallo}
}

🤝 Acknowledgments

We would like to thank the contributors to the Hallo, wav2vec 2.0, Chinese-wav2vec2, Q-Align, Syncnet, VBench, and Moore-AnimateAnyone repositories, for their open research and extraordinary work.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
configs		configs
data		data
examples		examples
joyhallo		joyhallo
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README-zh.md		README-zh.md
README.md		README.md
accelerate_config.yaml		accelerate_config.yaml
joyhallo-alltrain.sh		joyhallo-alltrain.sh
joyhallo-app.sh		joyhallo-app.sh
joyhallo-infer.sh		joyhallo-infer.sh
joyhallo-train.sh		joyhallo-train.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JoyHallo: Digital human model for Mandarin

📖 Introduction

🎬 Videos-Mandarin-Woman

🎬 Videos-Mandarin-Man

🎬 Videos-English

🧳 Framework

⚙️ Installation

🎒 Prepare model checkpoints

1. Download base checkpoints

2. Download chinese-wav2vec2-base model

3. Download JoyHallo model

4. pretrained_models contents

🚧 Data requirements

🚀 Inference

1. Inference with command line

2. Inference with web demo

⚓️ Train or fine-tune JoyHallo

1. Use the following command to start training from Stage 1

2. Use the following command to train only Stage 2

🎓 Prepare training data

1. Prepare the data in the following directory structure, ensuring that the data meets the requirements mentioned earlier

2. Use the following command to process the dataset

💻 Comparison

1. Accuracy comparison in Mandarin

2. Accuracy comparison in English

3. Inference efficiency comparison

📝 Citations

🤝 Acknowledgments

About

Releases

Packages

Languages

License

jdh-algo/JoyHallo

Folders and files

Latest commit

History

Repository files navigation

JoyHallo: Digital human model for Mandarin

📖 Introduction

🎬 Videos-Mandarin-Woman

🎬 Videos-Mandarin-Man

🎬 Videos-English

🧳 Framework

⚙️ Installation

🎒 Prepare model checkpoints

1. Download base checkpoints

2. Download chinese-wav2vec2-base model

3. Download JoyHallo model

4. pretrained_models contents

🚧 Data requirements

🚀 Inference

1. Inference with command line

2. Inference with web demo

⚓️ Train or fine-tune JoyHallo

1. Use the following command to start training from Stage 1

2. Use the following command to train only Stage 2

🎓 Prepare training data

1. Prepare the data in the following directory structure, ensuring that the data meets the requirements mentioned earlier

2. Use the following command to process the dataset

💻 Comparison

1. Accuracy comparison in Mandarin

2. Accuracy comparison in English

3. Inference efficiency comparison

📝 Citations

🤝 Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages