This repo aims at providing an easy to use and efficient code for extracting image & text features using the official OpenAI CLIP models, which is also optimized for multi processing GPU feature extraction.
The official OpenAI CLIP repo only supports extracting global visual features, while the local grid features from CLIP visual models may also contain more detailed semantic information which can benefit multi visual-and-language downstream tasks[1][2]. As an alternative, this repo encapsulates minor-modified CLIP code in order to extract not only global visual features but also local grid visual features from different CLIP visual models. What's more, this repo is designed in a user-friendly object-oriented fashion, allowing users to add their customized visual_extractor
classes easily to customize different input and output grid resolution.
To verify the semantic meaning of the extracted visual grid features, we also applied the extracted visual grid features of MSCOCO images from different official CLIP models for standard image captioning task. We got comparable or superior results in transformer baseline easily without hard-tuning hyperparameters, via simply replacing BUTD features with the extracted CLIP gird features. Surprisingly, we got 116.9
CIDEr score in teacher-forcing setting and 129.6
in reinforcement learning setting when using ViT-B/32
CLIP model, which conflicts with the experiment results in CLIP-ViL paper[1] where the authors observed that CLIP-ViT-B with grid features has a large performance degradation compared with other models (58.0
CIDEr score in CLIP-ViT-B_Transformer
setting in COCO Captioning).
We provide supported CLIP models, results on MSCOCO image captioning, and other information below. We believe this repo can facilitate the usage of powerful CLIP models.
Currently this repo supports five visual extractor settings, including three standard pipelines used in official OpenAI CLIP repo and two additional customized pipelines supporting larger input resolution. You can refer to this file for more details about customizing your own visual backbones for different input and output resolution. In order to imporve training efficiency in image captioning task, we apply AvgPool2d
to the output feature map to reduce grid features size in some settings without large performance degradation. We will support more CLIP models in the future.
Visual Backbone | CLIP Model | Input Resolution | Output Resolution | Feature Map Downsample | Grid Feature Shape | Global Feature Shape | |
---|---|---|---|---|---|---|---|
Standard | RN101 | RN101 | 224 x 224 | 7 x 7 | None | 49 x 2048 | 1 x 512 |
ViT-B/32 | ViT-B/32 | 224 x 224 | 7 x 7 | None | 49 x 768 | 1 x 512 | |
ViT-B/16 | ViT-B/16 | 224 x 224 | 14 x 14 | AvgPool2d(kernel_size=(2,2), stride=2) | 49 x 768 | 1 x 512 | |
Customized | RN101_448 | RN101 | 448 x 448 | 14 x 14 | AvgPool2d(kernel_size=(2,2), stride=2) | 49 x 2048 | 1 x 512 |
ViT-B/32_448 | ViT-B/32 | 448 x 448 | 14 x 14 | AvgPool2d(kernel_size=(2,2), stride=2) | 49 x 768 | 1 x 512 |
We ran image captioning experiments on X-modaler with the extracted CLIP grid features. We easily got comparable or superior results in transformer baseline using the default hyperparameters in X-modaler's transformer baseline, except for SOLVER.BASE_LR=2e-4
in ViT-B/16
and ViT-B/32_448
teacher-forcing settings. The performance of transformer baseline using BUTD features is taken from X-modaler's paper.
Name | BLEU@1 | BLEU@2 | BLEU@3 | BLEU@4 | METEOR | ROUGE-L | CIDEr-D | SPICE |
---|---|---|---|---|---|---|---|---|
BUTD_feat | 76.4 | 60.3 | 46.5 | 35.8 | 28.2 | 56.7 | 116.6 | 21.3 |
RN101 | 77.3 | 61.3 | 47.7 | 36.9 | 28.7 | 57.5 | 120.6 | 21.8 |
ViT-B/32 | 76.4 | 60.3 | 46.5 | 35.6 | 28.1 | 56.7 | 116.9 | 21.2 |
ViT-B/16 | 78.0 | 62.1 | 48.2 | 37.2 | 28.8 | 57.6 | 122.3 | 22.1 |
RN101_448 | 78.0 | 62.4 | 48.9 | 38.0 | 29.0 | 57.9 | 123.6 | 22.1 |
ViT-B/32_448 | 75.8 | 59.6 | 45.9 | 35.1 | 27.8 | 56.3 | 114.2 | 21.0 |
Name | BLEU@1 | BLEU@2 | BLEU@3 | BLEU@4 | METEOR | ROUGE-L | CIDEr-D | SPICE |
---|---|---|---|---|---|---|---|---|
BUTD_feat | 80.5 | 65.4 | 51.1 | 39.2 | 29.1 | 58.7 | 130.0 | 23.0 |
RN101 | 81.3 | 66.4 | 52.1 | 40.3 | 29.6 | 59.6 | 134.2 | 23.4 |
ViT-B/32 | 79.9 | 64.6 | 50.4 | 38.5 | 29.0 | 58.6 | 129.6 | 22.8 |
ViT-B/16 | 82.0 | 67.3 | 53.1 | 41.1 | 29.9 | 59.8 | 136.6 | 23.8 |
RN101_448 | 81.6 | 66.9 | 52.6 | 40.6 | 29.9 | 59.8 | 136.2 | 23.9 |
ViT-B/32_448 | 79.9 | 64.6 | 50.4 | 38.7 | 28.8 | 58.4 | 127.8 | 22.6 |
Note: The extracted feature files are compatible with X-modaler, where you can setup your experiments about cross-modal analytics conveniently.
- PyTorch ≥ 1.9 and torchvision that matches the PyTorch installation. Install them together at pytorch.org to make sure of this
- timm ≥ 0.4.5
- Use CLIP
ViT-B/32
model to extract global textual features of MSCOCO sentences fromdataset_coco.json
in Karpathy's released annotations.
CUDA_VISIBLE_DEVICES=0 python3 clip_textual_feats.py \
--anno dataset_coco.json \
--output_dir ${TXT_OUTPUT_DIR} \
--model_type_or_path 'ViT-B/32'
- Use CLIP
ViT-B/16
model to extract global and grid visual features of MSCOCO images.
CUDA_VISIBLE_DEVICES=0 python3 clip_visual_feats.py \
--image_list 'example/MSCOCO/image_list_2017.txt' \
--image_dir ${IMG_DIR} \
--output_dir ${IMG_OUTPUT_DIR} \
--ve_name 'ViT-B/16' \
--model_type_or_path 'ViT-B/16'
- Use CLIP
RN101
model to extract global and grid visual features of MSCOCO images.
CUDA_VISIBLE_DEVICES=0 python3 clip_visual_feats.py \
--image_list 'example/MSCOCO/image_list_2017.txt' \
--image_dir ${IMG_DIR} \
--output_dir ${IMG_OUTPUT_DIR} \
--ve_name 'RN101' \
--model_type_or_path 'RN101'
- Use CLIP
RN101
model to extract global and grid visual features of MSCOCO images with 448 x 448 resolution.
CUDA_VISIBLE_DEVICES=0 python3 clip_visual_feats.py \
--image_list 'example/MSCOCO/image_list_2017.txt' \
--image_dir ${IMG_DIR} \
--output_dir ${IMG_OUTPUT_DIR} \
--ve_name 'RN101_448' \
--model_type_or_path 'RN101'
You can run the same script with same input list (i.e. --image_list
or --anno
) on another GPU (that can be from a different machine, provided that the disk to output the features is shared between the machines). The script will create a new feature extraction process that will only focus on processing the items that have not been processed yet, without overlapping with the other extraction process already running.
MIT
This repo used resources from OpenAI CLIP, timm, CLIP-ViL, X-modaler. The repo is implemented using PyTorch. We thank the authors for open-sourcing their awesome projects.
[1] How Much Can CLIP Benefit Vision-and-Language Tasks? Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer. In Arxiv2021.
[2] In Defense of Grid Features for Visual Question Answering. Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, Xinlei Chen. In CVPR2020.
[3] X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics. Yehao Li, Yingwei Pan, Jingwen Chen, Ting Yao, Tao Mei. In ACMMM2021 Open Source Software Competition.