AMBER: An Automated Multi-dimensional Benchmark for Multi-modal Hallucination Evaluation

Junyang Wang*¹, Yuhang Wang*¹, Guohai Xu², Jing Zhang¹, Yukai Gu¹, Haitao jia¹, Jiaqi Wang¹

Haiyang Xu², Ming Yan², Ji Zhang², Jitao Sang¹

¹Beijing Jiaotong University ²Alibaba Group

*Equal Contribution

Introduction

AMBER is An LLM-free Multi-dimensional Benchmark for MLLMs hallucination evaluation, which can be used to evaluate both generative task and discriminative task including existence, attribute and relation hallucination. AMBER has a fine-grained annotation and automated evaluation pipeline. The data statistics and objects distribution. The results of mainstream MLLMs evaluated by AMBER.

News

🔥 [11.17] Our data and annotations are available!
[11.14] Our paper is available at LINK.

Getting Started

Installation

1. spacy is used for near-synonym judgment

pip install -U spacy
python -m spacy download en_core_web_lg

2. nltk is used for objects extraction

pip install nltk

Image Download

Download the images from this LINK.

Responses Generation

json file	Task or Dimension	Evaluation args
query_all.json	All the tasks and dimensions	a
query_generative.json	Generative task	g
query_discriminative.json	Discriminative task	d
query_discriminative-existence.json	Existence dimension	de
query_discriminative-attribute.json	Attribute dimension	da
query_discriminative-relation.json	Relation dimension	dr

For generative task (1 <= id <= 1004), the format of responses is:

[
	{
		"id": 1,
		"response": "The description of AMBER_1.jpg from MLLM."
	},
	
	......
	
	{
		"id": 1004,
		"response": "The description of AMBER_1004.jpg from MLLM."
	}
]

For discriminative task (id >= 1005), the format of responses is:

[
	{
		"id": 1005,
		"response": "Yes" or "No"
	},
	
	......
	
	{
		"id": 15220,
		"response": "Yes" or "No"
	}
]

Evaluation

python inference.py --inference_data path/to/your/inference/file --evaluation_type {Evaluation args}

Citation

If you found this work useful, consider giving this repository a star and citing our paper as followed:

@article{wang2023llm,
  title={An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation},
  author={Wang, Junyang and Wang, Yuhang and Xu, Guohai and Zhang, Jing and Gu, Yukai and Jia, Haitao and Yan, Ming and Zhang, Ji and Sang, Jitao},
  journal={arXiv preprint arXiv:2311.07397},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README_File		README_File
data		data
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AMBER: An Automated Multi-dimensional Benchmark for Multi-modal Hallucination Evaluation

Introduction

News

Getting Started

Installation

Image Download

Responses Generation

Evaluation

Citation

About

Releases

Packages

Languages

License

junyangwang0410/AMBER

Folders and files

Latest commit

History

Repository files navigation

AMBER: An Automated Multi-dimensional Benchmark for Multi-modal Hallucination Evaluation

Introduction

News

Getting Started

Installation

Image Download

Responses Generation

Evaluation

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages