This repository contains the evaluation scriots for the paper A Challenging Multimodal Video Summary: Simultaneously Extracting and Generating Keyframe-Caption Pairs from Video.
Please change the directory to be mounted in docker-compose.yml to the directory where this repository is cloned. Also, please change the working directory to any directory. Please specify the json file of the inference result obtained from Note that do not change target directory in the docker-compose.yml.
docker-compose up -d
docker exec -it $container_name bash
cd /work
huggingface-cli download tohoku-nlp/multi-vidsum-eval --local-dir . --local-dir-use-symlinks False --repo-type dataset
tar -xzvf downloads.tar.gz --no-same-owner
You can evaluate the result by the following command.
cd eval_reranking
RESULT_FILE=/work/downloads/inference_results/sample_result.json
PRETRAINED_MODEL_NAME=vid2seq
TAG=sample
bash eval.sh $RESULT_FILE $PRETRAINED_MODEL_NAME $TAG
- RESULT_FILE: json file of the inference result obtained from
- PRETRAINED_MODEL_NAME: name of the pretrained model used for captioning. See eval_reranking/eval.sh for details.
- TAG: tag for the result. Specify any string.