🍅 TOMATO

📄 Paper | 🤗 Data | 🎬 Videos

This repository contains the implementation of the following paper:

🍅 TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models
Ziyao Shangguan*¹, Chuhan Li*¹, Yuxuan Ding¹, Yanan Zheng¹, Yilun Zhao¹, Tesca Fitzgerald¹, Arman Cohan¹²
*Equal contribution.
¹Yale University ²Allen Institute of AI

TOMATO - A Visual Temporal Reasoning Benchmark

Introduction

Our study of existing benchmarks shows that visual temporal reasoning capabilities of Multimodal Foundation Models (MFMs) are likely overestimated as many questions can be solved by using a single, few, or out-of-order frames. To systematically examine current visual temporal reasoning tasks, we propose three principles with corresponding metrics: (1) Multi-Frame Gain, (2) Frame Order Sensitivity, and (3) Frame Information Disparity.

Following these principles, we introduce TOMATO, a novel benchmark crafted to rigorously assess MFMs' temporal reasoning capabilities in video understanding. TOMATO comprises 1,484 carefully curated, human-annotated questions spanning 6 tasks (i.e. action count, direction, rotation, shape&trend, velocity&frequency, and visual cues), applied to 1,417 videos, including 805 self-recorded and -generated videos, that encompass 3 video scenarios (i.e. human-centric, real-world, and simulated). In the 805 self-created videos, we apply editing to incorporate counterfactual scenes, composite motions, and zoomed-in views, aiming to investigate the impact of these characteristics on the performance of MFMs.

Task Examples

What direction(s) does the Ping Pong ball rotate in?
A. Clockwise throughout.
B. No rotation.
C. Clockwise then counter-clockwise.
D. Counter-clockwise throughout.
E. Counter-clockwise then clockwise.

Answer: D. Counter-clockwise throughout.

What is the pattern of the object’s speed in the video?
A. Not moving at all.
B. Constant speed.
C. Decelerating.
D. Accelerating.

Answer: C. Decelerating.

What instruction did the person give to the camera in the video?
A. Moving Down.
B. Moving Left.
C. Moving Further.
D. Moving Closer.
E. Moving Right.
F. Moving Up.

Answer: E. Moving Right.

How many triangle(s) does the person draw in the air throughout the entire video?
A. 0
B. 1
C. 2
D. 3
E. 4
F. 5

Answer: E. 4

Analysis Highlight

Our in-depth error case analysis reveals that models lack the basic ability to interpret frames as a continuous sequence. In the example, while GPT-4o correctly generates captions for each consecutive change in the moon's movement, showcasing its ability to reason at individual time steps, it still fails to infer based on the captions that the overall sequence represents a clockwise rotation and falsely concludes that it is a counter-clockwise rotation.

For more detailed error case analysis, please refer to Section 6.3 in our paper.

Dataset and Evaluation

1. Setup

git clone https://github.com/yale-nlp/TOMATO
cd TOMATO

Download the videos and unzip into the /TOMATO directory

After downloading the videos, your file structure should look like this.

.
├── data/
├── src/
├── videos/
│   ├── human/
│   ├── object/
│   ├── simulated/

1.1 Proprietary Models

To install the required packages for evaluating proprietary models, run:

pip install openai # GPT 
pip install google-generativeai # Gemini 
pip install anthropic # Claude
pip install reka-api==2.0.0 # Reka

Create a .env file in the root directory with the following format:

OPENAI_API_KEY="your_openai_api_key"
GEMINI_API_KEY="your_gemini_api_key"
ANTHROPIC_API_KEY="your_anthropic_api_key"
REKA_API_KEY="your_reka_api_key"

1.2 Open-sourced Models

Create a directory named pretrained in the root of TOMATO to store open-sourced models. For example, to download Qwen-2-VL-7B model, run the following command:

mkdir pretrained && cd pretrained
huggingface-cli download 
  --resume-download 
  --local-dir-use-symlinks False Qwen/Qwen2-VL-7B-Instruct 
  --local-dir Qwen2-VL-7B-Instruct

After downloading open-sourced models, your file structure should look like this.

.
├── data/
├── src/
├── videos/
├── pretrained/
│   ├── Qwen2-VL-7B-Instruct/
│   ├── ...

Note: To use Video-CCAM, LLaVA-NeXT, Video-LLaVA, VideoLLaMA2, and VILA, follow additional instructions below.
Clone their repositories into the ./src/generate_lib/ directory. Run the following commands:

cd ./src/generate_lib

git clone git@github.com:QQ-MM/Video-CCAM.git             # Video-CCAM
git clone git@github.com:LLaVA-VL/LLaVA-NeXT.git          # LLaVA-NeXT
git clone git@github.com:DAMO-NLP-SG/VideoLLaMA2.git      # VideoLLaMA2
git clone git@github.com:PKU-YuanGroup/Video-LLaVA.git    # Video-LLaVA
git clone git@github.com:NVlabs/VILA.git                  # VILA

After cloning, rename the directories by replacing hyphens (-) with underscores (_):

mv Video-CCAM Video_CCAM
mv LLaVA-NeXT LLaVA_NeXT
mv Video-LLaVA Video_LLaVA

2. Evaluation

To run evaluation with a model:

python src/evaluate.py 
  --model $model_name
  --reasoning_type ALL 
  --demonstration_type ALL 
  --total_frames $total_frames

All supported models are listed here. To evaluate additional models, please refer to the next section.

This is a list of models that take in videos directly and any specified total_frames will be ignore.

You can specify a subset of reasoning_type and demonstration_type using a comma-seperated list. These are the lists of valid choices.

3. Result Parsing

When our standard parser using regular expression fails, we employ GPT-4o-mini to extract answers from model response. To use the parser:

python src/parse_result.py

Note: This parser is designed to be incremental. It only parses additional raw model responses while leaving the already parsed results unchanged.

4. Display Categorized Scores

Scores are grouped by model, reasoning_type+model, and demonstration_type+model. To display scores:

python src/get_categorized_score.py

Evaluate Additional Models

Our evaluation scripts are designed for the ease of adding additional models, simply:

1. Add Model to Config File

Add model_family and model_name to src/config.json like below:

{
    "models": {
        "{model_family}": [
            "{model_name}",
            "..."
        ]

2. Add Model Evaluation Code

Create the corresponding {model_family}.py file under src/generate_lib with the starter code below:

from generate_lib.constant import GENERATION_TEMPERATURE, GENERATION_TOP_P, SYSTEM_PROMPT, MAX_TOKENS, GENERATION_SEED
from generate_lib.construct_prompt import construct_prompt
from generate_lib.utils import read_video

def generate_response(model_name: str, queries: list, total_frames: int, output_dir: str):
    # initialize your model 
    model = ...

    for query in queries:
      id_ = query['id']
      question = query['question']
      gt = optionized_list[query['answer']]

      # construct prompt
      base64Frames, _ = read_video(video_path=video_path, total_frames=total_frames)
      prompt, all_choices, index2ans = construct_prompt(question=question,
                                                        options=options,
                                                        num_frames=total_frames)
      
      # generate response
      response = model(...)

      # save model response in default format to use our result parser
      with open(output_dir, "a") as f:
            f.write(json.dumps(
                {
                    "id": id_,
                    "question": question,
                    "response": response,
                    "all_choices": all_choices,
                    "index2ans": index2ans,
                    'gt': gt
                }
            ) + "\n")

Experiments

1. Comparison with Existing Benchmarks

1.1 Multi-Frame Gain ($\kappa$): a higher value indicates the task is less solvable by a single frame.

1.2 Frame Order Sensitivity ($\tau$): a higher value indicates the task is more reliant on the correct order of frames.

1.3 Frame Information Parity ($\rho$): a lower value indicates information is more evenly distributed across the frames.

2. Leaderboard

We evaluate general-purpose MFMs on TOMATO, with all models tested in a zero-shot setting. The scores below are represented percentage accuracy (%).

Contact

If you have any questions or suggestions, please don't hesitate to let us know. You can post an issue on this repository, or contact us directly at:

Ziyao Shangguan: ziyao.shangguan@yale.edu
Chuhan Li: chuhan.li.cl2575@yale.edu

Citation

If you find 🍅TOMATO useful for your research and applications, please cite using this BibTex:

@misc{shangguan2024tomatoassessingvisualtemporal,
      title={TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models}, 
      author={Ziyao Shangguan and Chuhan Li and Yuxuan Ding and Yanan Zheng and Yilun Zhao and Tesca Fitzgerald and Arman Cohan},
      year={2024},
      eprint={2410.23266},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.23266}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
misc		misc
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🍅 TOMATO

📄 Paper | 🤗 Data | 🎬 Videos

TOMATO - A Visual Temporal Reasoning Benchmark

Introduction

Task Examples

Analysis Highlight

Dataset and Evaluation

1. Setup

1.1 Proprietary Models

1.2 Open-sourced Models

2. Evaluation

3. Result Parsing

4. Display Categorized Scores

Evaluate Additional Models

1. Add Model to Config File

2. Add Model Evaluation Code

Experiments

1. Comparison with Existing Benchmarks

1.1 Multi-Frame Gain ($\kappa$): a higher value indicates the task is less solvable by a single frame.

1.2 Frame Order Sensitivity ($\tau$): a higher value indicates the task is more reliant on the correct order of frames.

1.3 Frame Information Parity ($\rho$): a lower value indicates information is more evenly distributed across the frames.

2. Leaderboard

Contact

Citation

About

Releases

Packages

Languages

yale-nlp/TOMATO

Folders and files

Latest commit

History

Repository files navigation

🍅 TOMATO

📄 Paper | 🤗 Data | 🎬 Videos

TOMATO - A Visual Temporal Reasoning Benchmark

Introduction

Task Examples

Analysis Highlight

Dataset and Evaluation

1. Setup

1.1 Proprietary Models

1.2 Open-sourced Models

2. Evaluation

3. Result Parsing

4. Display Categorized Scores

Evaluate Additional Models

1. Add Model to Config File

2. Add Model Evaluation Code

Experiments

1. Comparison with Existing Benchmarks

1.1 Multi-Frame Gain ($\kappa$): a higher value indicates the task is less solvable by a single frame.

1.2 Frame Order Sensitivity ($\tau$): a higher value indicates the task is more reliant on the correct order of frames.

1.3 Frame Information Parity ($\rho$): a lower value indicates information is more evenly distributed across the frames.

2. Leaderboard

Contact

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages