We present CCoT, a novel Ccompositional Chain-of-Thought prompting method that utilizes scene-graph representations in order to extract compositional knowledge from an LMM. We find that this approach not only improves LMM performance on several compositional benchmarks but also general multimodal benchmarks as well.
A more thorough discussion of our work can be found in our paper.
The first step in our prompting method is to generate a scene graph given both the image and textual task as context. Following this, the answer is extracted by prompting the LMM with the image, scene graph, question, and answer extraction prompt. Prompt sections unique to our method are shown in bold in the above figure. Incorporating the scene graph in the prompt eliminates the need for fine-tuning and prevents forgetting. Another benefit of our method is that generated SGs can describe any visual scene, therefore making CCoT generally applicable to a wider range of VL tasks. Finally, the fact that the generated scene graphs are compact linguistic representations of images makes CCoT a token-efficient prompting method. This is significant given the limited textual context lengths that LMMs often face due to processing both image and text inputs.
Note that because our method is a zero-shot prompting method and makes use of the codebase of its respective LMM, there is ample flexibility when applying it to your particular model and use case. As such, you may find it easier to simply use the general methodology shown in our figure and outlined in our scripts with a different prompt, implementation, and evaluation methodology to suit your needs.
Please retrieve all datasets from their respective official websites or repositories. We do provide the filtered .jsonl containing just the SEEDBench-Image data points in our data folder.
- First, clone the official LLaVA repository.
git clone https://github.com/haotian-liu/LLaVA.git
- Follow the basic installation steps outlined in the repository.
- Complete the Evaluation setup outlined in the repository.
- Replace the corresponding scripts (both Python or Bash scripts where necessary) with those in our repository here.
Note: We find some users are having issues with the input processing when directly cloning the repo. This is likely because the post-LLaVA-1.6 update changes the way inputs to the model are handled. One way to remedy this is to check out the newest commit before the LLaVA-1.6 update
- Install the openai library:
pip install openai
- Set your openai key:
export OPENAI_API_KEY=
- Run the script for your desired dataset.
- First, clone the official LLaVA Repository.
- Follow the basic installation steps outlined in the repository.
- Run the script for your desired dataset.
- For SEEDBench and MMBench, we make use of the LLaVA codebase's setup. Simply follow the LLaVA-1.5 setup steps and replace the scripts with those of Sphinx.
- For other datasets, follow setup instructions from the official repository
- Run our provided script.
If you found our work useful, please consider starring and citing. Thank you!
@inproceedings{MitraCCoT,
title={Compositional Chain of Thought Prompting for Large Multimodal Models},
author={Mitra, Chancharik and Huang, Brandon and Darrell, Trevor and Herzig, Roei},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month={June},
year={2024}
}