GitHub - eric-ai-lab/ComCLIP: Official implementation and dataset for the NAACL 2024 paper "ComCLIP: Training-Free Compositional Image and Text Matching"

ComCLIP: Training-Free Compositional Image and Text Matching

This is the code implementation for the NAACL2024 paper: "ComCLIP: Training-Free Compositional Image and Text Matching" [Arxiv][Project Website]

Datasets

Please follow the instructions below to prepare the datasets.

Winoground
Download images and store them as datasets/winoground_images. Code includes the download of csv file.
Compositional Visual Genome (ComVG)
Download images and store them as datasets/comvg_images. Test csv file at at datasets/ComVG.csv
SVO-Probe
Download dataset and store the images as datasets/SVO-Probes. Store csv as datasets/svo-probes.csv
Flickr30k
Download images and store them as datasets/flickr30k_image (Please only select images that are in the test sets). Test pickle file is datasets/flickr30k_test.pkl.

Usage

Preparation

Please follow GRiT and detectron2 Setup and CLIP Setup first. Download grit_b_densecap_objectdet.pth and store it in GRiT/models. Please follow SLIP and download ViT-L weights in SLIP/MODEL_PATH

conda create --name comclip python=3.10
conda activate comclip
pip install -r requirements.txt
pip install git+https://github.com/openai/CLIP.git

Winoground

### clip baseline
python winoground/clip_baseline.py --huggingface_token HUGGINGFACE_TOKEN
### blip baseline
python winoground/blip_baseline.py --huggingface_token HUGGINGFACE_TOKEN
### slip baseline
python winoground/slip_baseline.py --huggingface_token HUGGINGFACE_TOKEN

### comclip 
winoground/comclip.sh datasets/winoground_images DENSE_CAPTION_PATH PARSE_TEXT_PATH GRiT_MODEL HUGGINGFACE_KEY OPENAI_KEY
### comblip
winoground/comclip.sh datasets/winoground_images DENSE_CAPTION_PATH PARSE_TEXT_PATH GRiT_MODEL HUGGINGFACE_KEY OPENAI_KEY
### comslip
winoground/comslip.sh datasets/winoground_images DENSE_CAPTION_PATH PARSE_TEXT_PATH GRiT_MODEL HUGGINGFACE_KEY OPENAI_KEY

ComVG & SVO-Probes

### clip baseline
python ComVG/clip_baseline.py --model ViT-L/14 --data_path datasets/ComVG.csv --image_path datasets/comvg_images
### comclip 
ComVG/comclip.sh datasets/comvg_images DENSE_CAPTION_PATH GRiT_MODEL_PATH datasets/ComVG.csv OPENAI_KEY ViT-L/14

Flick30k (image retrieval)

### clip baseline (precompted in datasets/flickr30k_test.pkl already)
python image_retrieval/clip_baseline.py --model VISION_ENCODER_TYPE --dataset datasets/flickr30k_test.pkl --image_path datasets/flickr30k_image
### comclip 
image_retrieval/comclip.sh datasets/flickr30k_image DENSE_CAPTION_FOLDER GRiT_MODEL_PATH datasets/flickr30k_test.pkl OPENAI_KEY VISION_ENCODER_VERSION

Acknowledgement

This code is mainly built on 1.GRiT 2.CLIP. We thank the authors for their model and code.

Citation

@article{jiang2022comclip,
  title={Comclip: Training-free compositional image and text matching},
  author={Jiang, Kenan and He, Xuehai and Xu, Ruize and Wang, Xin Eric},
  journal={arXiv preprint arXiv:2211.13854},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
ComVG		ComVG
GRiT		GRiT
assets		assets
datasets		datasets
demo		demo
image_retrieval		image_retrieval
winoground		winoground
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ComCLIP: Training-Free Compositional Image and Text Matching

Datasets

Usage

Preparation

Winoground

ComVG & SVO-Probes

Flick30k (image retrieval)

Acknowledgement

Citation

About

Releases

Packages

Languages

License

eric-ai-lab/ComCLIP

Folders and files

Latest commit

History

Repository files navigation

ComCLIP: Training-Free Compositional Image and Text Matching

Datasets

Usage

Preparation

Winoground

ComVG & SVO-Probes

Flick30k (image retrieval)

Acknowledgement

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages