This is the official implementation of the ACL 2023 paper "VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions"
- release dialogues
- release feature (resnet, rcnn)
- release test data (2024.07.17)
- release meta data (genres, keywords, storyline, characters: name, avatar)
- release frames
support language: English, 简体中文
- Downloads
Storage: Train (196M); Valid(11.6M); test (24M)
Links: BaiduNetDisk or GoogleDrive
- Statistics
clips | dialogues | scene/clip | topic/clip | |
---|---|---|---|---|
Train | 172,041 | 4,319,381 | 2.42 | 3.68 |
Val | 9753 | 250,311 | 2.64 | 4.29 |
Test | 9779 | 250,436 | 2.56 | 4.12 |
- Format
{
"dialogs":[
{
"clip_id": "Friends_S01E01_clip_000",
"dialog": ["hi", ...],
"scene": [1, 1, 1, 1, 1, 1, 2, 2, ...],
"session": [1, 1, 1, 2, 2, 2, 3, 3, ...]
},
...
]
}
- Downloads
Storage: RCNN(246.2G), RESNET(109G)
Links: BaiduNetDisk
- Format
File Structure:
# [name of TV show]_S[season]_E[episode]_clip_[clip id].npy
├── Friends_S01E01
└── Friends_S01E01_clip_000.npy
└── Friends_S01E01_clip_001.npy
└── ...
├── ...
ResNet:
# numpy.load("Friends_S01E01_clip_000.npy")
(num_of_frames * 1000)
RCNN:
# numpy.load("Friends_S01E01_clip_000.npy", allow_pickle=True).item()
{
"feature": (9 * num_of_frames * 2048) # array(float32), feature top 9 objects
"size": (num_of_frames * 2) # list(int), size of original frame
"box": (9 * num_of_frames * 4) # array(float32), bbox
"obj_id": (9 * num_of_frames) # list(int), object id
"obj_conf": (9 * num_of_frames) # array(float32), object conference
"obj_num": (num_of_frames) # list(int), number of objects/frame
}
- Feature Extraction Tools
Please Refer to OpenViDial_extract_features
pip install -r requirements.txt
- Preprocess
move train.json
, valid.json
, test.json
to inputs/full
directory
run following script to change the original to binary format to run our baseline smoothly (check in our paper)
cd inputs/full
python preprocess.py
- Train
python train_seg.py \
--video 1 \
--exp_set EXP_LOG \
--train_batch_size 4 \
- Infer
python generate_seg.py \
--ckptid SAVED_CKPT_ID \
--gpuid 0 \
--exp_set EXP_LOG \
--video 1 \
- Train
python train_seg.py \
--video 0 \
--exp_set EXP_LOG \
--train_batch_size 4 \
- Infer
python generate_seg.py \
--ckptid SAVED_CKPT_ID \
--gpuid 0 \
--exp_set EXP_LOG \
--video 0 \
To use coco_caption
for evaluation, run the following script to generate the reference file:
cd inputs/full
python coco_caption_reformat.py
for the evaluation details, please refer to: https://github.com/tylin/coco-caption
- Train
python train_gen.py \
--train_batch_size 4 \
--model bart \
--exp_set EXP_LOG \
--video 1 \
--fea_type resnet \
- Infer
python generate.py \
--ckptid SAVED_CKPT_ID \
--gpuid 0 \
--exp_set EXP_LOG \
--video 1 \
--sess 1 \
--batch_size 4
@misc{wang2023vstar,
title={VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions},
author={Yuxuan Wang and Zilong Zheng and Xueliang Zhao and Jinpeng Li and Yueqian Wang and Dongyan Zhao},
year={2023},
eprint={2305.18756},
archivePrefix={arXiv},
primaryClass={cs.CV}
}