VLMEvalKit now supports MMT-Bench, and the built-in functions will automatically download when you first use them. You can also download the MMT-Bench dataset in the following links: HuggingFace, ModelScope.
We have provided four dataset files:
- MMT-Bench_VAL: Used for local model evaluation (10% of the samples), where multiple images in multi-image samples are combined into a single image.
- MMT-Bench_VAL_MI: Used for local model evaluation (10% of the samples), but multi-image samples are stored as separate images.
- MMT-Bench_ALL: The FULL set (100% of the samples) evaluated on this server, where multiple images in multi-image samples are combined into a single image.
- MMT-Bench_ALL_MI: Also the FULL set (100% of the samples) evaluated on this server, but multi-image samples are stored as separate images.
Note: "MI" indicates that multi-image tasks are preserved in their original format, without "MI" indicating that multi-images are combined into a single image for evaluation. The evaluation of single-image tasks remains the same in both cases. We recommend that LVLM models capable of handling multi-image inputs use the MI files (MMT-Bench_VAL_MI, MMT-Bench_ALL_MI) for evaluation, while those not supporting multi-image inputs should use the combined version (MMT-Bench_VAL, MMT-Bench_ALL) for testing.
Put the data under the ~/LMUData/
Installation.
pip install -e .
Setup Keys.
To infer with API models (GPT-4v, Gemini-Pro-V, etc.) or use LLM APIs as the judge or choice extractor, you need to first setup API keys. VLMEvalKit will use an judge LLM to extract answer from the output if you set the key, otherwise it uses the exact matching mode (find "Yes", "No", "A", "B", "C"... in the output strings). The exact matching can only be applied to the Yes-or-No tasks and the Multi-choice tasks.
-
You can place the required keys in
$VLMEvalKit/.env
or directly set them as the environment variable. If you choose to create a.env
file, its content will look like:# The .env file, place it under $VLMEvalKit # Alles-apin-token, for intra-org use only ALLES= # API Keys of Proprietary VLMs DASHSCOPE_API_KEY= GOOGLE_API_KEY= OPENAI_API_KEY= OPENAI_API_BASE= STEPAI_API_KEY=
-
Fill the blanks with your API keys (if necessary). Those API keys will be automatically loaded when doing the inference and evaluation.
VLM Configuration: All VLMs are configured in vlmeval/config.py
, for some VLMs, you need to configure the code root (MiniGPT-4, PandaGPT, etc.) or the model_weight root (LLaVA-v1-7B, etc.) before conducting the evaluation. During evaluation, you should use the model name specified in supported_VLM
in vlmeval/config.py
to select the VLM. For MiniGPT-4 and InstructBLIP, you also need to modify the config files in vlmeval/vlm/misc
to configure LLM path and ckpt path.
Following VLMs require the configuration step:
Code Preparation & Installation: InstructBLIP (LAVIS), LLaVA (LLaVA), MiniGPT-4 (MiniGPT-4), mPLUG-Owl2 (mPLUG-Owl2), OpenFlamingo-v2 (OpenFlamingo), PandaGPT-13B (PandaGPT), TransCore-M (TransCore-M).
Manual Weight Preparation & Configuration: InstructBLIP, LLaVA-v1-7B, MiniGPT-4, PandaGPT-13B
We use run.py
for evaluation. To use the script, you can use $VLMEvalKit/run.py
or create a soft-link of the script (to use the script anywhere):
Arguments
--data (list[str])
: Set the dataset names that are supported in VLMEvalKit (defined invlmeval/utils/dataset_config.py
).--model (list[str])
: Set the VLM names that are supported in VLMEvalKit (defined insupported_VLM
invlmeval/config.py
).--mode (str, default to 'all', choices are ['all', 'infer'])
: Whenmode
set to "all", will perform both inference and evaluation; when set to "infer", will only perform the inference.--nproc (int, default to 4)
: The number of threads for OpenAI API calling.
Command
You can run the script with python
or torchrun
:
# When running with `python`, only one VLM instance is instantiated, and it might use multiple GPUs (depending on its default behavior).
# That is recommended for evaluating very large VLMs (like IDEFICS-80B-Instruct).
# IDEFICS-80B-Instruct on MMT-Bench, Inference and Evalution
python run.py --data MMT-Bench_VAL --model idefics_80b_instruct --verbose
# IDEFICS-80B-Instruct on MMT-Bench, Inference only
python run.py --data MMT-Bench_VAL --model idefics_80b_instruct --verbose --mode infer
# IDEFICS-9B-Instruct, Qwen-VL-Chat, mPLUG-Owl2 on MMT-Bench. On a node with 8 GPU. Inference and Evaluation.
torchrun --nproc-per-node=8 run.py --data MMT-Bench_VAL --model idefics_80b_instruct qwen_chat mPLUG-Owl2 --verbose
# Qwen-VL-Chat on MMT-Bench. On a node with 2 GPU. Inference and Evaluation.
torchrun --nproc-per-node=2 run.py --data MMT-Bench_VAL --model qwen_chat --verbose
The evaluation results will be printed as logs, besides. Result Files will also be generated in the directory $YOUR_WORKING_DIRECTORY/{model_name}
. Files ending with .csv
contain the evaluated metrics.
The default setting mentioned above uses OpenAI's GPT as the judge LLM. However, you can also deploy a local judge LLM with LMDeploy.
First install:
pip install lmdeploy openai
And then deploy a local judge LLM with the single line of code. LMDeploy will automatically download the model from Huggingface. Assuming we use internlm2-chat-1_8b as the judge, port 23333, and the key sk-123456 (the key must start with "sk-" and follow with any number you like):
lmdeploy serve api_server internlm/internlm2-chat-1_8b --server-port 23333
You need to get the model name registered by LMDeploy with the following python code:
from openai import OpenAI
client = OpenAI(
api_key='sk-123456',
base_url="http://0.0.0.0:23333/v1"
)
model_name = client.models.list().data[0].id
Now set some environment variables to tell VLMEvalKit how to use the local judge LLM. As mentioned above, you can also set them in $VLMEvalKit/.env
file:
OPENAI_API_KEY=sk-123456
OPENAI_API_BASE=http://0.0.0.0:23333/v1/chat/completions
LOCAL_LLM=<model_name you get>
Finally, you can run the commands in step 2 to evaluate your VLM with the local judge LLM.
Note that
- If you hope to deploy the judge LLM in a single GPU and evaluate your VLM on other GPUs because of limited GPU memory, try
CUDA_VISIBLE_DEVICES=x
like
CUDA_VISIBLE_DEVICES=0 lmdeploy serve api_server internlm/internlm2-chat-1_8b --server-port 23333
CUDA_VISIBLE_DEVICES=1,2,3 torchrun --nproc-per-node=3 run.py --data HallusionBench --model qwen_chat --verbose
- If the local judge LLM is not good enough in following the instructions, the evaluation may fail. Please report such failures (e.g., by issues).
- It's possible to deploy the judge LLM in different ways, e.g., use a private LLM (not from HuggingFace) or use a quantized LLM. Please refer to the LMDeploy doc. You can use any other deployment framework if they support OpenAI API.