This repo contains the evaluation code for the paper CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning .
We release the validation set of CMMU, you can download it from here. The test set will be hosted on the flageval platform. Users can test by uploading their models.
CMMU is a novel multi-modal benchmark designed to evaluate domain-specific knowledge across seven foundational subjects: math, biology, physics, chemistry, geography, politics, and history. It comprises 3603 questions, incorporating text and images, drawn from a range of Chinese exams. Spanning primary to high school levels, CMMU offers a thorough evaluation of model capabilities across different educational stages.
We currently evaluated 10 models on CMMU. The results are shown in the following table.
Evaluate fill-in-the-blank questions by GPT-4
Model | Val Avg. | Test Avg. |
---|---|---|
InstructBLIP-13b | 0.39 | 0.48 |
CogVLM-7b | 5.55 | 4.90 |
ShareGPT4V-7b | 7.95 | 7.63 |
mPLUG-Owl2-7b | 8.69 | 8.58 |
LLava-1.5-13b | 11.36 | 11.96 |
Qwen-VL-Chat-7b | 11.71 | 12.14 |
Intern-XComposer-7b | 17.87 | 18.42 |
Gemini-Pro | 21.58 | 22.50 |
Qwen-VL-Plus | 27.51 | 27.73 |
GPT-4V | 30.19 | 30.91 |
Evaluate fill-in-the-blank questions by rules
Model | Val Avg. | Test Avg. |
---|---|---|
InstructBLIP-13b | 0.04 | 0.00 |
CogVLM-7b | 4.02 | 3.73 |
ShareGPT4V-7b | 5.85 | 5.07 |
mPLUG-Owl2-7b | 7.08 | 6.85 |
LLava-1.5-13b | 8.17 | 8.06 |
Qwen-VL-Chat-7b | 7.73 | 8.28 |
Intern-XComposer-7b | 16.95 | 16.30 |
Gemini-Pro | 18.77 | 17.87 |
Qwen-VL-Plus | 21.19 | 21.34 |
GPT-4V | 24.73 | 25.23 |
from eval.cmmu_dataset import CmmuDataset
# CmmuDataset will load *.jsonl files in data_root
dataset = CmmuDataset(data_root=your_path_to_cmmu_dataset)
About fill-in-the-blank questions
For fill-in-the-blank questions, CmmuDataset
will generate new questions by sub_question
, for example:
The original question is:
{
"type": "fill-in-the-blank",
"question_info": "question",
"id": "subject_1234",
"sub_questions": ["sub_question_0", "sub_question_1"],
"answer": ["answer_0", "answer_1"]
}
Converted questions are:
[
{
"type": "fill-in-the-blank",
"question_info": "question" + "sub_question_0",
"id": "subject_1234-0",
"answer": "answer_0"
},
{
"type": "fill-in-the-blank",
"question_info": "question" + "sub_question_1",
"id": "subject_1234-1",
"answer": "answer_1"
}
]
About ShiftCheck
The parameter shift_check
is True
by default, you can get more information about shift_check
in our technical report.
CmmuDataset
will generate k new questions by shift_check
, their ids are {original_id}-k
.
The output format should be a list of json dictionaries, the required key is as follows:
{
"question_id": "question id",
"answer": "answer"
}
Here is result generated by GPT-4.
Current code call gpt4 API by AzureOpenAI
, maybe you need to modify eval/chat_llm.py
to create your own client, and before run evaluation, you need to set environment variables like AZURE_OPENAI_API_KEY
and AZURE_OPENAI_ENDPOINT
.
Run
python eval/evaluate.py --result your_pred_file --data-root your_path_to_cmmu_dataset
NOTE: We evaluate fill-in-the-blank questions using GPT-4 by default. If you do not have access to GPT-4, you can attempt to use a rule-based method to fill in the blanks. However, be aware that the results might differ from the official ones.
python eval/evaluate.py --result your_pred_file --data-root your_path_to_cmmu_dataset --gpt none
To evaluate specific type of questions, you can use --qtype
parameter, for example:
python eval/evaluate.py --result example/gpt4v_results_val.json --data-root your_path_to_cmmu_dataset --qtype fbq mrq
Detailed evaluation results are saved in *_result.json
BibTeX:
@article{he2024cmmu,
title={CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning},
author={Zheqi He, Xinya Wu, Pengfei Zhou, Richeng Xuan, Guang Liu, Xi Yang, Qiannan Zhu and Hua Huang},
journal={arXiv preprint arXiv:2401.14011},
year={2024},
}