Zhuoyan Luo*, Yicheng Xiao*, Yong Liu*, Shuyan Li, Yitong Wang, Yansong Tang, Xiu Li, Yujiu Yang
Tsinghua University Intelligent Interaction Group
- Jan. 1, 2024: We Release the Code for the ICCV 2023 Workshop: The 5th Large-scale Video Object Segmentation Challenge.
- Oct. 29, 2023: Code is released now.
- Sep. 22, 2023: Our paper is accepted by NeurIPS 2023!
This paper studies referring video object segmentation (RVOS) by boosting videolevel visual-linguistic alignment. Recent approaches model the RVOS task as a sequence prediction problem and perform multi-modal interaction as well as segmentation for each frame separately. However, the lack of a global view of video content leads to difficulties in effectively utilizing inter-frame relationships and understanding textual descriptions of object temporal variations. To address this issue, we propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment. By associating a group of frame-level object embeddings with language tokens, SOC facilitates joint space learning across modalities and time steps. Moreover, we present multi-modal contrastive supervision to help construct wellaligned joint space at the video level. We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin. Besides, the emphasis on temporal coherence enhances the segmentation stability and adaptability of our method in processing text expressions with temporal variations.
(a) and (b) are segmentation results of our SOC and ReferFormer. For more details, please refer to paper
- install pytorch
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
- install other dependencies
pip install h5py opencv-python protobuf av einops ruamel.yaml timm joblib pandas matplotlib cython scipy
- install transformers numpy
pip install transformers==4.24.0
pip install numpy==1.23.5
- install pycocotools
pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
- build up MultiScaleDeformableAttention
cd ./models/ops python setup.py build install
The Overall data preparation is set as followed. We put rvosdata under the path /mnt/data_16TB/lzy23/rvosdata and please change it to xxx/rvosdata according to your own path.
rvosdata
└── a2d_sentences/
├── Release/
│ ├── videoset.csv (videos metadata file)
│ └── CLIPS320/
│ └── *.mp4 (video files)
└── text_annotations/
├── a2d_annotation.txt (actual text annotations)
├── a2d_missed_videos.txt
└── a2d_annotation_with_instances/
└── */ (video folders)
└── *.h5 (annotations files)
└── refer_youtube_vos/
├── train/
│ ├── JPEGImages/
│ │ └── */ (video folders)
│ │ └── *.jpg (frame image files)
│ └── Annotations/
│ └── */ (video folders)
│ └── *.png (mask annotation files)
├── valid/
│ └── JPEGImages/
│ └── */ (video folders)
| └── *.jpg (frame image files)
└── meta_expressions/
├── train/
│ └── meta_expressions.json (text annotations)
└── valid/
└── meta_expressions.json (text annotations)
└── coco/
├── train2014/
├── refcoco/
├── instances_refcoco_train.json
├── instances_refcoco_val.json
├── refcoco+/
├── instances_refcoco+_train.json
├── instances_refcoco+_val.json
├── refcocog/
├── instances_refcocog_train.json
├── instances_refcocog_val.json
We create a folder for storing all pretrained model and put them in the path /mnt/data_16TB/lzy23/pretrained, please change to xxx/pretrained according to your own path.
pretrained
└── pretrained_swin_transformer
└── pretrained_roberta
- For pretrained_swin_transformer folder download Video-Swin-Base
- For pretrained_roberta folder download config.json pytorch_model.bin tokenizer.json vocab.json from huggingface (roberta-base)
The checkpoints are as follows:
Setting | Backbone | Checkpoint |
---|---|---|
a2d_from_scratch | Video-Swin-T | Model |
a2d_with_pretrain | Video-Swin-T | Model |
a2d_with_pretrain | Video-Swin-B | Model |
ytb_from_scratch | Video-Swin-T | Model |
ytb_with_pretrain | Video-Swin-T | Model |
ytb_with_pretrain | Video-Swin-B | Model |
ytb_joint_train | Video-Swin-T | Model |
ytb_joint_train | Video-Swin-B | Model |
We put all outputs under a dir. Specifically, We set /mnt/data_16TB/lzy23/SOC as the output dir, so please change it to xxx/SOC.
We only use Video-Swin-T as backbone to train and eval the dataset.
-
A2D Run the scripts "./scripts/train_a2d.sh" and make sure that change the path "/mnt/data_16TB/lzy23" to your own path(same as the following).
bash ./scripts/train_a2d.sh
The key parameters are as follows and change the ./configs/a2d_sentences.yaml:
lr backbone_lr bs GPU_num Epoch lr_drop 5e-5 5e-6 2 2 40 15(0.2) -
Ref-Youtube-VOS Run the "./scripts/train_ytb.sh.
bash ./scripts/train_ytb.sh
The main parameters are as follow:
lr backbone_lr bs num_class GPU_num freeze_text_encoder lr_drop Epoch 1e-4 1e-5 1 65 8 true 20(0.1) 30 Please change the ./configs/refer_youtube_vos.yaml according to the setting
Change the dataset_path according to your own path in ./datasets/refer_youtube_vos/refer_youtube_vos_dataset.py
We perform pretrain and finetune on A2d-Sentences and Ref-Youtube-VOS dataset using Video-Swin-Tiny and Video-Swin-Base. Following previous work, we first pretrain on RefCOCO dataset and then finetune.
-
Pretrain
The followings are the key parameters for pretrain. When pretrain, please specify the corresponding backbone. (Video-Swin-T and Video-Swin-B)
lr backbone_lr text_encoder_lr bs num_class GPU_num freeze_text_encoder lr_drop Epoch 1e-4 1e-5 5e-6 8 1 8 False 15 20(0.1) 30 -
Ref-Youtube-VOS
We finetune the pretrained weight using the following key parameters:
lr backbone_lr text_encoder_lr bs num_class GPU_num freeze_text_encoder lr_drop Epoch 1e-4 1e-5 5e-6 8 1 8 False 10(0.1) 25 -
A2D-Sentences
We finetune the pretrained weight on A2D-Sentences using the following key parameters:
lr backbone_lr text_encoder_lr bs num_class GPU_num freeze_text_encoder lr_drop Epoch 3e-5 3e-6 1e-6 1 1 8 true - 20
We only perform Joint training on Ref-Youtube-VOS dataset with Video-Swin-Tiny and Video-Swin-Base.
-
Ref-Youtube-VOS
Run the scripts ./scripts/train_joint.sh. Remember to change the path and the backbone name before running.
The main parameters (Tiny and Base) are as follow:
lr backbone_lr bs num_class GPU_num freeze_text_encoder lr_drop Epoch 1e-4 1e-5 1 1 8 true 20(0.1) 30
-
A2D-Sentences Run the scripts ./scripts/eval_a2d.sh and remember to specify the checkpoint_path in the config file.
-
JHMDB-Sentences Please refer to Link to prepare for the datasets and specify the checkpoint path in yaml file. Following the previous setting, we directly use the checkpoint trained on A2d-Sentences to test.
-
Ref-Youtube-VOS
bash ./scripts/infer_ref_ytb.sh
Remember to specify the checkpoint_path and the video backbone name.
-
Ref-DAVIS2017 Please refer to Link to prepare for the DAVIS dataset. We provide the infer_davis.sh to evaluate. Remember to specify the checkpoint_path and the video backbone name.
We provide the interface for inference
bash ./scripts/demo_video.sh
Code in this repository is built upon several public repositories. Thanks for the wonderful work Referformer and MTTR
If you find this work useful for your research, please cite:
@inproceedings{SOC,
author = {Zhuoyan Luo and
Yicheng Xiao and
Yong Liu and
Shuyan Li and
Yitong Wang and
Yansong Tang and
Xiu Li and
Yujiu Yang},
title = {{SOC:} Semantic-Assisted Object Cluster for Referring Video Object
Segmentation},
booktitle = {NeurIPS},
year = {2023},
}