M³CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought

[ArXiv] | [🤗HuggingFace] | [Website]

🌟 Any contributions via PRs, issues, emails or other methods are greatly appreciated.

🔥News

🎖️ Our work is accepted by ACL2024.
🔥 We have release benchmark on [🤗HuggingFace].
🔥 The paper is also available on [ArXiv].
🔮 Interactive benchmark website & more exploration are available on [https://lightchen233.github.io/m3cot.github.io/].

💡 Motivation

Multi-modal Chain-of-Thought (MCoT) requires models to leverage knowledge from both textual and visual modalities for step-by-step reasoning, which gains increasing attention. Nevertheless, the current MCoT benchmark still faces some challenges: (1) absence of visual modal reasoning, (2) single-step visual modal reasoning, and (3) Domain missing, thereby hindering the development of MCoT. Motivated by this, we introduce a novel benchmark (M³CoT) to address the above challenges, advancing the multi-domain, multi-step, and multi-modal CoT. Additionally, we conduct a thorough evaluation involving abundant MCoT approaches on Vision Large Language Models (VLLMs). In addition, we highlight that the current VLLMs still struggle to correctly reason in M³CoT and there remains a large gap between existing VLLMs and human performance in M³CoT, despite their superior results on previous MCoT benchmarks. To our knowledge, we take the first meaningful step toward the multi-domain, multi-step, and multi-modal scenario in MCoT. We hope that M³CoT can serve as a valuable resource, providing a pioneering foundation in multi-domain, multi-step, multi-modal chain-of-thought research.

🎯 Installation

1. Dataset Preparation

Load Dataset from Huggingface

import datasets
dataset = datasets.load_dataset("LightChen2333/M3CoT")

Load Dataset from Google Drive

Please download the corresponding data set from Here and place the unzipped content in the data folder.

import datasets
dataset = datasets.load_dataset("data/m3cot.py")

In addition, we also hope that you will use our M3CoT class to better manage and analyze data. Our class supports two initialization formats:

import datasets
from utils.data import M3CoT
dataset = datasets.load_dataset("data/m3cot.py")
prepared_dataset = M3CoT(dataset=dataset)

And

from utils.data import M3CoT
prepared_dataset = M3CoT(data_path="data")

2. Install from git

M3CoT requires Python>=3.10, and torch>=2.0.

git clone https://github.com/LightChen233/M3CoT.git && cd M3CoT/
pip install -r requirements.txt

3. Evaluation for reproduction

python evaluate.py --setting zero-shot \
                   --model gpt4v \
                   --prompt cot \
                   --metric_by topic

where --setting can be selected from [zero-shot, few-shot, tool-usage]. --metric_by can be selected from [topic, domain, all]

For zero-shot setting:

--model can be selected from [kosmos-2, cogvlm, gemini, gpt4v, instruct-blip-7b, instruct-blip-13b, llava-7b, llava-13b, openflamingo]
--prompt can be selected from [direct, cot, ccot, dsp]

4. Evaluation for your results

python evaluate.py --setting custom \
                   --metric_path [JSONL_PATH]

Among them, each line of file in jsonl must meet the following format:

{
  "id": "[ID]",
  "choices": ["[CHOICE1]", "[CHOICE2]", ...],
  "answer": "A/B/C/...",
  "domain": "[DOMAIN]",
  "topic": "[TOPIC]",
  "messages": [
    "[QUESTION]",
    "[ANSWER]"
  ]
}

🖨️File Structure

root
├── data           # data folder where the dataset is loaded
├── experiment     # All experimental data
│   ├── zero-shot         # Experimental results under zero-shot setting. Subfolders are for each model, and each model folder contains the results of three prompts.
│   ├── few-shot          # Experimental results under few-shot setting.
│   └── tool-usage        # Experimental results under tool-usage setting.
├── utils          # Tool library folder
│   ├── common_tool.py    # Some common utility functions
│   ├── data.py           # Dataset loading class
│   ├── gemini_request.py # Gemini request tool
│   ├── image_tool.py     # Image processing function.
│   └── metric.py         # Indicator calculation tool.
├── scripts
│   ├── load_dataset.py   # Example script to load a dataset
│   └── parse_to_sqa_format.py   # Convert dataset to ScienceQA format
└── evaluate.py     # Evaluation script

✒️ Reference

If you find this project useful for your research, please consider citing the following paper:

@inproceedings{chen-etal-2024-m3cot,
    title = "M$^3$CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought",
    author = "Chen, Qiguang  and
      Qin, Libo  and
      Zhang, Jin  and
      Chen, Zhi  and
      Xu, Xiao  and
      Che, Wanxiang",
    booktitle = "Proc. of ACL",
    year = "2024",
}

📲 Contact

Please create Github issues here or email Qiguang Chen if you have any questions or suggestions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

M³CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought

🔥News

💡 Motivation

🎯 Installation

1. Dataset Preparation

Load Dataset from Huggingface

Load Dataset from Google Drive

2. Install from git

3. Evaluation for reproduction

4. Evaluation for your results

🖨️File Structure

✒️ Reference

📲 Contact

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
experiment/zero-shot		experiment/zero-shot
image		image
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
evaluate.py		evaluate.py
requirements.txt		requirements.txt

LightChen233/M3CoT

Folders and files

Latest commit

History

Repository files navigation

M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought

🔥News

💡 Motivation

🎯 Installation

1. Dataset Preparation

Load Dataset from Huggingface

Load Dataset from Google Drive

2. Install from git

3. Evaluation for reproduction

4. Evaluation for your results

🖨️File Structure

✒️ Reference

📲 Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

M³CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought

Packages