Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks Download Link HERE
We release the public largest Chinese high-quality video-language dataset (10 million) named Youku-mPLUG, which is collected from a well-known Chinese video-sharing website, named Youku, with strict criteria of safety, diversity, and quality.
Examples of video clips and titles in the proposed Youku-mPLUG dataset.
We provide 3 different downstream multimodal video benchmark datasets to measure the capabilities of pre-trained models. The 3 different tasks include:
- Video Category Prediction:Given a video and its corresponding title, predict the category of the video.
- Video-Text Retrieval:In the presence of some videos and some texts, use video for text retrieval and text for video retrieval.
- Video Captioning:In the presence of a video, describe the content of the video.
The dataset contains 10 million videos in total, which are of high quality and distributed in 20 super categories can 45 categories.
The distribution of categories in Youku-mPLUG dataset.
You can download all the videos and annotation files through this link
Note: Due to a bug in megatron_util, after installing megatron_util, it is necessary to replace conda/envs/youku/lib/python3.10/site-packages/megatron_util/initialize.py with the initialize.py in the current directory.
conda env create -f environment.yml
conda activate youku
pip install megatron_util==1.3.0 -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
# For caption evaluation
apt-get install default-jre
First you should download GPT-3 1.3B & 2.7B checkpoint from Modelscope. The pre-trained model can be downloaded Here (1.3B) and Here (2.7B).
Running the pre-training of mPLUG-Video as:
exp_name='pretrain/gpt3_1.3B/pretrain_gpt3_freezeGPT_youku_v0'
PYTHONPATH=$PYTHONPATH:./ \
python -m torch.distributed.launch --nproc_per_node=8 --master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
--nnodes=$WORLD_SIZE \
--node_rank=$RANK \
--use_env run_pretrain_distributed_gpt3.py \
--config ./configs/${exp_name}.yaml \
--output_dir ./output/${exp_name} \
--enable_deepspeed \
--bf16
2>&1 | tee ./output/${exp_name}/train.log
To perform downstream fine-tuning. We take Video Category Prediction as an example:
exp_name='cls/cls_gpt3_1.3B_youku_v0_sharp_2'
PYTHONPATH=$PYTHONPATH:./ \
python -m torch.distributed.launch --nproc_per_node=8 --master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
--nnodes=$WORLD_SIZE \
--node_rank=$RANK \
--use_env downstream/run_cls_distributed_gpt3.py \
--config ./configs/${exp_name}.yaml \
--output_dir ./output/${exp_name} \
--enable_deepspeed \
--resume path/to/1_3B_mp_rank_00_model_states.pt \
--bf16
2>&1 | tee ./output/${exp_name}/train.log
Below we show the results on the validation sets for reference.
We build the mPLUG-Video model based on mPLUG-Owl. To use the model, you should first clone the mPLUG-Owl repo as
git clone https://github.com/X-PLUG/mPLUG-Owl.git
cd mPLUG-Owl/mPLUG-Owl
The instruction-tuned checkpoint is available on HuggingFace. For finetuning the model, you can refer to mPLUG-Owl Repo. To perform video inference you can use the following code:
import torch
from mplug_owl_video.modeling_mplug_owl import MplugOwlForConditionalGeneration
from transformers import AutoTokenizer
from mplug_owl_video.processing_mplug_owl import MplugOwlImageProcessor, MplugOwlProcessor
pretrained_ckpt = 'MAGAer13/mplug-youku-bloomz-7b'
model = MplugOwlForConditionalGeneration.from_pretrained(
pretrained_ckpt,
torch_dtype=torch.bfloat16,
device_map={'': 0},
)
image_processor = MplugOwlImageProcessor.from_pretrained(pretrained_ckpt)
tokenizer = AutoTokenizer.from_pretrained(pretrained_ckpt)
processor = MplugOwlProcessor(image_processor, tokenizer)
# We use a human/AI template to organize the context as a multi-turn conversation.
# <|video|> denotes an video placehold.
prompts = [
'''The following is a conversation between a curious human and AI assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
Human: <|video|>
Human: 视频中的女人在干什么?
AI: ''']
video_list = ['yoga.mp4']
# generate kwargs (the same in transformers) can be passed in the do_generate()
generate_kwargs = {
'do_sample': True,
'top_k': 5,
'max_length': 512
}
inputs = processor(text=prompts, videos=video_list, num_frames=4, return_tensors='pt')
inputs = {k: v.bfloat16() if v.dtype == torch.float else v for k, v in inputs.items()}
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
res = model.generate(**inputs, **generate_kwargs)
sentence = tokenizer.decode(res.tolist()[0], skip_special_tokens=True)
print(sentence)
If you find this dataset useful for your research, please consider citing our paper.
@misc{xu2023youku_mplug,
title={Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks},
author={Haiyang Xu, Qinghao Ye, Xuan Wu, Ming Yan, Yuan Miao, Jiabo Ye, Guohai Xu, Anwen Hu, Yaya Shi, Chenliang Li, Qi Qian, Que Maofei, Ji Zhang, Xiao Zeng, Fei Huang},
year={2023},
eprint={2306.04362},
archivePrefix={arXiv},
primaryClass={cs.CL}
}