SEED-Bench is a multimodal benchmark of 19K multiple-choice questions with accurate human annotations for evaluating Multimodal LLMs, covering 12 evaluation dimensions including both image and video understanding.
Qwen-VL and Qwen-VL-Chat achieve SOTAs on this benchmark.
Qwen-VL and Qwen-VL-Chat didn't train any video data or tasks during training, but they can understand some videos in a zero-shot way. For the video question-answering task, we utilize four uniformly sampled frames per video sample. These frames are treated as separate images and are stitched into the context. For example:
{
"question_id": "v0",
"prompt": "<img>video_imgs_4/v0_0.jpg</img>\n<img>video_imgs_4/v0_1.jpg</img>\n<img>video_imgs_4/v0_2.jpg</img>\n<img>video_imgs_4/v0_3.jpg</img>\nQuestion: Can you identify the action taking place in the video?\nOptions: A. pretending to take something out of something\nB. pretending to take something from somewhere\nC. feigning to insert something into something\nD. simulating putting something onto something\nAnswer:"
}
The above JSON line can be used as the input by eval_mm/seed_bench/eval.py
and output the following results:
{"question_id": "v0", "prediction": "B"}
Please see eval_mm/seed_bench/eval.py for more inference details.
- Download all images and videos by following the instruction. Then modify the root path in
eval_mm/seed_bench/trans.py
with your customized path.
# path of SEED-Bench.json, download from https://huggingface.co/datasets/AILab-CVC/SEED-Bench/blob/main/SEED-Bench.json
seed_bench_input_path = 'SEED-Bench.json'
# root directory of evaluation dimension 1-9, following https://github.com/AILab-CVC/SEED-Bench/blob/main/DATASET.md
cc3m_dir = "/YOUR_PATH_TO/seed_bench_image"
# root directory of evaluation dimension 10
dimension10_dir = "/YOUR_PATH_TO/SSV2/videos"
# root directory of evaluation dimension 11
dimension11_dir = "/YOUR_PATH_TO/EPIC-KITCHENS/3h91syskeag572hl6tvuovwv4d/videos/test"
# root directory of evaluation dimension 12
dimension12_dir = "/YOUR_PATH_TO/BreakfastII_15fps_qvga_sync"
- Generate input files of Qwen-VL with the JSON formatting.
cd eval_mm/seed_bench/
python trans.py
This script will output two JSONL files and one directory. image_input.jsonl
is the input file of image evaluation and video_input_4.jsonl
is the input file of video evaluation by 4 frames. The directory video_imgs_4
contains all 4-framed images extracted from videos. We provide our image_input.jsonl and video_input_4.jsonl here for reference.
- Produce the results of Seed-Bench.
# The number of available GPUs
export NPROC_PER_NODE=8
# Produce the Qwen-VL-Chat results of image understanding
python -m torch.distributed.launch --use-env \
--nproc_per_node ${NPROC_PER_NODE:-8} \
--nnodes ${WORLD_SIZE:-1} \
--node_rank ${RANK:-0} \
--master_addr ${MASTER_ADDR:-127.0.0.1} \
--master_port ${MASTER_PORT:-12345} \
eval.py \
--checkpoint Qwen/Qwen-VL-Chat \
--dataset image_input.jsonl \
--batch-size 4 \
--num-workers 2
# Collect the result files
cat result_?.jsonl >results_chat_img.jsonl
rm result_?.jsonl
# Produce the results of video understanding
python -m torch.distributed.launch --use-env \
--nproc_per_node ${NPROC_PER_NODE:-8} \
--nnodes ${WORLD_SIZE:-1} \
--node_rank ${RANK:-0} \
--master_addr ${MASTER_ADDR:-127.0.0.1} \
--master_port ${MASTER_PORT:-12345} \
eval.py \
--checkpoint Qwen/Qwen-VL-Chat \
--dataset video_input_4.jsonl \
--batch-size 2 \
--num-workers 1
# Collect the result files
cat result_?.jsonl >results_chat_vid.jsonl
rm result_?.jsonl
# The file `results_chat.jsonl` can be submitted to the leaderboard
cat results_chat_img.jsonl results_chat_vid.jsonl >results_chat.jsonl
You can reproduce the Seed-Bench results of Qwen-VL by replacing Qwen/Qwen-VL-Chat
with Qwen/Qwen-VL
on the above script.