AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description

Junyu Xie¹, Tengda Han¹, Max Bain¹, Arsha Nagrani¹, Gül Varol¹ ², Weidi Xie¹ ³, Andrew Zisserman¹

¹ Visual Geometry Group, Department of Engineering Science, University of Oxford
² LIGM, École des Ponts, Univ Gustave Eiffel, CNRS
³ CMIC, Shanghai Jiao Tong University

Requirements

Basic Dependencies: pytorch=2.0.0, Pillow, pandas, decord, opencv, moviepy=1.0.3 transformers=4.37.2 accelerate==0.26.1
VideoLLaMA2: After installation, modify the sys.path.append("/path/to/VideoLLaMA2") in stage1/main.py and stage1/utils.py. Please download the VideoLLaMA2-7B checkpoint here.
Set up cache model path (for LLaMA3, etc.) by modifying os.environ['TRANSFORMERS_CACHE'] = "/path/to/cache/" in stage1/main.py and stage2/main.py

Datasets

In this work, we evaluate our model on CMD-AD, MAD-Eval, and TV-AD.

Video Frames

CMD-AD can be downloaded here.
MAD-Eval can be downloaded here.
TV-AD adopts a subset of TV-QA as visual sources (3fps), and can be downloaded here. Each folder containing .jpg video frames needs to be converted to a .tar file. This can be done by the code provided in tools/compress_subdir.py.
For example,
```
python tools/compress_subdir.py \
--root_dir="resources/example_file_structures/tvad_raw/" \   # for downloaded raw (.jpg folders) files from TVQA
--save_dir="resources/example_file_structures/tvad/"         # for compressed tar files
```

Ground Truth AD Annotations

All annotations can be found in resources/annotations

Results

The AutoAD-Zero predictions can be downloaded here.

Inference

Stage I: VLM-Based Dense Video Description

python stage1/main.py \
--dataset={dataset} \                  #e.g. "cmdad"
--video_dir={video_dir} \
--anno_path={anno_path} \              #e.g. "resources/annotations/cmdad_anno_with_face_0.2_0.4.csv"
--charbank_path={charbank_path} \      #e.g. "resources/charbanks/cmdad_charbank.json" 
--model_path={videollama2_ckpt_path} \
--output_dir={output_dir}

--dataset: choices are cmdad, madeval, and tvad.
--video_dir: directory of video datasets, example file structures can be found in resources/example_file_structures (files are empty, for references only).
--anno_path: path to AD annotations (with predicted face IDs and bboxes), available in resources/annotations.
--charbank_path: path to external character banks, available in resources/charbanks.
--model_path: path to videollama2 checkpoint.
--output_dir: directory to save output csv.

Stage II: LLM-Based AD Summary

python stage2/main.py \
--dataset={dataset} \             #e.g. "cmdad"
--pred_path={stage1_result_path}

--dataset: choices are cmdad, madeval, and tvad.
--pred_path: path to the stage1 saved csv file.

Citation

If you find this repository helpful, please consider citing our work:

@InProceedings{xie2024autoad0,
	title={AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description},
	author={Junyu Xie and Tengda Han and Max Bain and Arsha Nagrani and G\"ul Varol and Weidi Xie and Andrew Zisserman},
	booktitle={ACCV},
	year={2024}
}

References

VideoLLaMA2: https://github.com/DAMO-NLP-SG/VideoLLaMA2
LLaMA3: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
resources		resources
stage1		stage1
stage2		stage2
tools		tools
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description

Requirements

Datasets

Video Frames

Ground Truth AD Annotations

Results

Inference

Stage I: VLM-Based Dense Video Description

Stage II: LLM-Based AD Summary

Citation

References

About

Releases

Packages

Languages

License

Jyxarthur/AutoAD-Zero

Folders and files

Latest commit

History

Repository files navigation

AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description

Requirements

Datasets

Video Frames

Ground Truth AD Annotations

Results

Inference

Stage I: VLM-Based Dense Video Description

Stage II: LLM-Based AD Summary

Citation

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages