Evaluation of CLIP image feature extractors

This repository contains experiments and results of comparison of image features extractors generated by classical training on ImageNet and CLIP[paper][repo][blog] training procedure on a specific Fruits-360 dataset.

Zero-shot prediсtions

The procedure described in the CLIP paper allows to make predictions on a new image dataset with any set of labels without training. Example of zero-shot predictions on Sports-72 dataset. Caption format: Predicted (True)

Experiments accomplished

We compared features extractors with different architectures, training procedures and image upsampling techniques. If an image upsampling technique is not mentioned, then bicubic interpolation is used. We performed the following two main sections of experiments:

Linear probing and fine-tuning of CLIP with ResNet and ViT backbones and ImageNet-pretrained ResNet and EfficientNet
Zero-shot and K-shot classification of CLIP with ViT and ResNet backbones

We also compared 2 image upsampling options:

Bucubic interpolations
SRGAN upsampling [weights]

We did it on the following training setups: linear probing and contrastive fine-tuning of CLIP with ResNet and ViT backbones.

Main plots can be found in the results section. Full experiments descriptions can be found in the supplementary/report.pdf

Repository structure

notebooks/ — contains experiments in form of jupyter notebooks
├── few_shot_learning.ipynb — k-shot learning procedure
├── image_upsampling.ipynb — two ways to upsample images with subsequent saving
├── prompts_validation.ipynb — finding the best prompt for given dataset
├── train_ImageNet_models.ipynb — fine-tuning of models pretrained on ImageNet in different settings
└── train_CLIP.ipynb — fine-tuning CLIP models in different settings
data_prepare/ — dataset upsampling auxilary source code
src/ — training related auxilary source code
pics/ — pictures for the results part
supplementary/ — contains report and presentation in .pdf format

Results

Zero-shot predictions

We tested zero-shot prediction performance of CLIP on a number of domain-specific datasets. These are Birds-270, Simpsons characters, Sports-72, Fruits-360. Here are some examples of the predictions:

Simpsons characters [link] ~ 0.51 accuracy

Birds-270 [link] ~ 0.52 accuracy

Fruits-360 [link] ~ 0.24 accuracy

Sports-72 [link] ~ 0.79 accuracy

K-shot training

Pretained CLIP model with ResNet-101 backbone + new fully-connected layer which is trained only on k examples of each class.

Fine-tuning with linear probing

Fune-tuning of visual parts of CLIP models with linear classifier on top with frozen/trainable backbones

Fine-tuning CLIP with different upsamplings

Fine-tuning CLIP visual models using different methods and upsamplings.

Maximizing likelihood (ML), i.e. training CLIP visual model + a linear layer on top
Cosine Similarity maximizing (CS). Fine-tune CLIP visual model to maximize cosine similarity between images of the same class.

Each method was tested with ResNet-101/ViT backbones and bicubic/GAN upsampling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluation of CLIP image feature extractors

Zero-shot prediсtions

Experiments accomplished

Repository structure

Results

Zero-shot predictions

Simpsons characters [link] ~ 0.51 accuracy

Birds-270 [link] ~ 0.52 accuracy

Fruits-360 [link] ~ 0.24 accuracy

Sports-72 [link] ~ 0.79 accuracy

K-shot training

Fine-tuning with linear probing

Fine-tuning CLIP with different upsamplings

About

Releases

Packages

Contributors 5

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
data_prepare		data_prepare
notebooks		notebooks
pics		pics
src		src
supplementary		supplementary
.gitignore		.gitignore
README.md		README.md

Godofnothing/CLIP_experimental

Folders and files

Latest commit

History

Repository files navigation

Evaluation of CLIP image feature extractors

Zero-shot prediсtions

Experiments accomplished

Repository structure

Results

Zero-shot predictions

Simpsons characters [link] ~ 0.51 accuracy

Birds-270 [link] ~ 0.52 accuracy

Fruits-360 [link] ~ 0.24 accuracy

Sports-72 [link] ~ 0.79 accuracy

K-shot training

Fine-tuning with linear probing

Fine-tuning CLIP with different upsamplings

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages