Skip to content

Repository for experiments with several approaches to fine-tune model, pretrained on the CLIP: https://openai.com/blog/clip/

Notifications You must be signed in to change notification settings

Godofnothing/CLIP_experimental

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Evaluation of CLIP image feature extractors

This repository contains experiments and results of comparison of image features extractors generated by classical training on ImageNet and CLIP[paper][repo][blog] training procedure on a specific Fruits-360 dataset.

Zero-shot prediсtions

The procedure described in the CLIP paper allows to make predictions on a new image dataset with any set of labels without training. Example of zero-shot predictions on Sports-72 dataset. Caption format: Predicted (True)

Experiments accomplished

We compared features extractors with different architectures, training procedures and image upsampling techniques. If an image upsampling technique is not mentioned, then bicubic interpolation is used. We performed the following two main sections of experiments:

  1. Linear probing and fine-tuning of CLIP with ResNet and ViT backbones and ImageNet-pretrained ResNet and EfficientNet
  2. Zero-shot and K-shot classification of CLIP with ViT and ResNet backbones

We also compared 2 image upsampling options:

  • Bucubic interpolations
  • SRGAN upsampling [weights]

We did it on the following training setups: linear probing and contrastive fine-tuning of CLIP with ResNet and ViT backbones.

Main plots can be found in the results section. Full experiments descriptions can be found in the supplementary/report.pdf

Repository structure

  • notebooks/ — contains experiments in form of jupyter notebooks
    ├── few_shot_learning.ipynb — k-shot learning procedure
    ├── image_upsampling.ipynb — two ways to upsample images with subsequent saving
    ├── prompts_validation.ipynb — finding the best prompt for given dataset
    ├── train_ImageNet_models.ipynb — fine-tuning of models pretrained on ImageNet in different settings
    └── train_CLIP.ipynb — fine-tuning CLIP models in different settings
  • data_prepare/ — dataset upsampling auxilary source code
  • src/ — training related auxilary source code
  • pics/ — pictures for the results part
  • supplementary/ — contains report and presentation in .pdf format

Results

Zero-shot predictions

We tested zero-shot prediction performance of CLIP on a number of domain-specific datasets. These are Birds-270, Simpsons characters, Sports-72, Fruits-360. Here are some examples of the predictions:

Simpsons characters [link] ~ 0.51 accuracy


Birds-270 [link] ~ 0.52 accuracy


Fruits-360 [link] ~ 0.24 accuracy


Sports-72 [link] ~ 0.79 accuracy


K-shot training

Pretained CLIP model with ResNet-101 backbone + new fully-connected layer which is trained only on k examples of each class.

Fine-tuning with linear probing

Fune-tuning of visual parts of CLIP models with linear classifier on top with frozen/trainable backbones

Fine-tuning CLIP with different upsamplings

Fine-tuning CLIP visual models using different methods and upsamplings.

  1. Maximizing likelihood (ML), i.e. training CLIP visual model + a linear layer on top
  2. Cosine Similarity maximizing (CS). Fine-tune CLIP visual model to maximize cosine similarity between images of the same class.

Each method was tested with ResNet-101/ViT backbones and bicubic/GAN upsampling

About

Repository for experiments with several approaches to fine-tune model, pretrained on the CLIP: https://openai.com/blog/clip/

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published