Skip to content

Latest commit

 

History

History
373 lines (291 loc) · 11 KB

README.md

File metadata and controls

373 lines (291 loc) · 11 KB

ReactFace: Online Multiple Appropriate Facial Reaction Generation in Dyadic Interactions

Project Page Paper Paper Code

generated_sample1.mp4
generated_sample2.mp4
generated_sample3.mp4

📢 News

  • Our paper has been accepted by IEEE Transactions on Visualization and Computer Graphics (TVCG)! 🎉🎉 (Oct/2024)

📋 Table of Contents

🛠️ Installation

Prerequisites

  • Python 3.8+
  • PyTorch 1.9+
  • CUDA 11.8+

Setup Environment

Create and activate conda environment

conda create -n react python=3.9
conda activate react

Install PyTorch

pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118

Install PyTorch3D

pip install --no-index --no-cache-dir pytorch3d -f https://dl.fbaipublicfiles.com/pytorch3d/packaging/wheels/py39_cu118_pyt201/download.html

Install other dependencies

pip install -r requirements.txt

👨‍🏫 Getting Started

1. Data Preparation

Download and Setup Dataset

The REACT 2023/2024 Multimodal Challenge Dataset is compiled from the following public datasets for studying dyadic interactions:

Apply for data access at:

Data organization (data/) follows this structure:

data/partition/modality/site/chat_index/person_index/clip_index/actual_data_files

Example data structure:

data
├── test
├── val
├── train
   ├── Video_files
       ├── NoXI
           ├── 010_2016-03-25_Paris
               ├── Expert_video
               ├── Novice_video
                   ├── 1
                       ├── 1.png
                       ├── ....
                       ├── 751.png
                   ├── ....
           ├── ....
       ├── RECOLA
   ├── Audio_files
       ├── NoXI
       ├── RECOLA
           ├── group-1
               ├── P25 
               ├── P26
                   ├── 1.wav
                   ├── ....
           ├── group-2
           ├── group-3
   ├── Emotion
       ├── NoXI
       ├── RECOLA
           ├── group-1
               ├── P25 
               ├── P26
                   ├── 1.csv
                   ├── ....
           ├── group-2
           ├── group-3
   ├── 3D_FV_files
       ├── NoXI
       ├── RECOLA
           ├── group-1
               ├── P25 
               ├── P26
                   ├── 1.npy
                   ├── ....
           ├── group-2
           ├── group-3

Important details:

  • Task: Predict one role's reaction ('Expert' or 'Novice', 'P25' or 'P26') to the other
  • 3D_FV_files contain 3DMM coefficients (expression: 52 dim, angle: 3 dim, translation: 3 dim)
  • Video specifications:
    • Frame rate: 25 fps
    • Resolution: 256x256
    • Clip length: 751 frames (~30s)
    • Audio sampling rate: 44100
  • CSV files for training/validation are available at: 'data/train.csv', 'data/val.csv', 'data/test.csv'
Download Additional Resources
  1. Listener Reaction Neighbors
    • Download the appropriate listener reaction neighbors dataset from here
    • Place the downloaded files in the dataset root folder
  2. Ground Truth 3DMMs
    • Download the ground truth 3DMMs (test set) for speaker-listener evaluation from here
    • Place the downloaded files in the metric/gt folder

2. External Tool Preparation

Required Models and Tools

We use 3DMM coefficients for 3D listener/speaker representation and 3D-to-2D frame rendering.

  1. 3DMM Model Setup

  2. PIRender Setup

    • We use PIRender for 3D-to-2D rendering
    • Download our retrained checkpoint (cur_model_fold.pth)
    • Place in external/PIRender/

3. Training

Training Options

Training with rendering during training:

python train.py \
  --batch-size 8 \
  --window-size 64 \
  --momentum 0.1 \
  --gpu-ids 0 \
  -lr 0.00002 \
  -e 200 \
  -j 4 \
  --sm-p 10 \
  --kl-p 0.00001 \
  --div-p 100 \
  --rendering \
  --outdir results/train-reactface

Training without rendering during validation (faster):

python train.py \
  --batch-size 8 \
  --window-size 64 \
  --momentum 0.1 \
  --gpu-ids 0 \
  -lr 0.00002 \
  -e 200 \
  -j 4 \
  --sm-p 10 \
  --kl-p 0.00001 \
  --div-p 100 \
  --outdir results/train-reactface

4. Evaluation

Generate Results

To generate listener reactions using a trained ReactFace model, run:

python evaluate.py \
  --split test \
  --batch-size 16 \
  --window-size 8 \
  --momentum 0.9 \
  --gpu-ids 0 \
  -j 4 \
  --rendering \
  --outdir results/eval \
  --resume results/training-reactface/best_checkpoint.pth
Metric-based Evaluations Our evaluation methodology is based on established research in Multiple Appropriate Listener Reaction:

Paper1 Paper2 Paper3

Metrics Overview

Diversity Metrics
  • FRDvs: Measures diversity across speaker behavior conditions
  • FRVar: Evaluates diversity within a single generated facial reaction sequence
  • FRDiv: Assesses diversity of different generated listener reactions to the same speaker behavior
Quality Metrics
  • FRRea: Uses Fréchet Video Distance (FVD) to evaluate realism of generated video sequences
  • FRCorr: Measures appropriateness by correlating each generated facial reaction with its most similar real facial reaction
  • FRSyn: Evaluates synchronization between generated listener reactions and varying speaker sequences

Running Evaluation

Execute the following command to compute all metrics:

python evaluate_metric.py \
  --split test \
  --gt-speaker-3dmm-path ./metric/gt/tdmm_speaker.npy \
  --gt-listener-3dmm-path ./metric/gt/tdmm_listener.npy \
  --gn-listener-3dmm-path ./results/eval/test/coeffs/tdmm_10x.npy

Assessing realism by FVD:

  • Download model(rgb_imagenet.pt) from the lib
  • Put the model to the folder metric/FVD/pytorch_i3d_model/models
  • Execute the following command to compute the FVD metric:
python metric/FVD/fvd_eval.py \
  --source-dir PATH/TO/A-COLLECTION-OF-GT-LISTENER-VIDEOS \
  --target-dir /path/to/your/generated/videos \
  --model-path metric/FVD/pytorch_i3d_model/models/rgb_imagenet.pt \
  --num-videos 100 \
  --frame-size 224 \
  --max-frames 750

5. Customized Inference

Generate Dyadic Reaction with Custom Speaker Video

Execute the following command to generate a listener's reaction to your speaker video:

python dyadic_reaction_inference.py \
    --speaker-video /path/to/your_video.mp4 \
    --speaker-audio /path/to/your_audio.wav \
    --listener-portrait /path/to/your_portrait.png \
    --window-size 8 \
    --momentum 0.9 \
    --output-dir results/customized_inference \
    --checkpoint results/training-reactface/best_checkpoint.pth

Required Inputs:

  • speaker-video: Path to the input speaker video file (MP4 format)
  • speaker-audio: Path to the speaker's audio file (WAV format)
  • listener-portrait: Path to the portrait photo of your custom listener (PNG format)

Optional Parameters:

  • window-size: Size of the temporal window (default: 8)
  • momentum: controlling speed (default: 0.9)
  • output-dir: Directory for saving generated results
  • checkpoint: Path to the trained model checkpoint

🖊️ Citation

If this work helps in your research, please cite the following papers:

@article{10756784,
  author={Luo, Cheng and Song, Siyang and Xie, Weicheng and Spitale, Micol and Ge, Zongyuan and Shen, Linlin and Gunes, Hatice},
  journal={IEEE Transactions on Visualization and Computer Graphics}, 
  title={ReactFace: Online Multiple Appropriate Facial Reaction Generation in Dyadic Interactions}, 
  year={2024},
  volume={},
  number={},
  pages={1-18},
}


@article{luo2023reactface,
  title={Reactface: Multiple appropriate facial reaction generation in dyadic interactions},
  author={Luo, Cheng and Song, Siyang and Xie, Weicheng and Spitale, Micol and Shen, Linlin and Gunes, Hatice},
  journal={arXiv preprint arXiv:2305.15748},
  year={2023}
}

🤝 Acknowledgements

Thanks to the open source of the following projects: