Skip to content

Latest commit

 

History

History
31 lines (20 loc) · 2.42 KB

EVALUATION.md

File metadata and controls

31 lines (20 loc) · 2.42 KB

Run Evaluation for SEED-Bench-2

The evaluation metric is provided in eval.py. We use InternLM_Xcomposer_VL as an example. To run the following evaluation code, please refer to repo for the environment preparation.

python eval.py --model InternLM_Xcomposer_VL --anno_path SEED-Bench_v2_level1_2_3.json --output-dir results --evaluate_level L2 --evaluate_part all --evaluate_version v2

Upon completion of the evaluation, the results will be available as 'results.json' in the 'results' folder.

If you want to evaluate your own models, please provide the interface like InternLM_Xcomposer_VL_interface.py, llava_v2_interface.py for LLaVA 1.5 or qwen_vl_chat_interface.py for qwen_vl_chat.

Run Evaluation for SEED-Bench-1

The evaluation metric is provided in eval.py. We use InstructBLIP as an example. To run the following evaluation code, please refer to repo for the environment preparation.

python eval.py --model instruct_blip --anno_path SEED-Bench.json --output-dir results --task all

After the evaluation is finished, you can obtain the accuracy of each evaluation dimension and also 'results.json' in 'results' folder, which can be submitted to SEED-Bench Leaderboard.

If you want to evaluate your own models, please provide the interface like instruct_blip_interface.py.

Note that to evaluate models with multiple-choice questions, we adopt the answer ranking strategy following GPT-3. Specifically, for each choice of a question, we compute the likelihood that a model generates the content of this choice given the question. We select the choice with the highest likelihood as model's prediction. Our evaluation strategy does not rely on the instruction-following capabilities of models to output 'A' or 'B' or 'C' or 'D'.