-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Everything CLIP related seems to break starting form transformers 4.28.0 #24857
Comments
Just to make something reproducible, here we can see that the output of CLIPProcessor changes. I run the script from PIL import Image
import requests
import transformers
from torchvision.transforms.functional import to_tensor
from transformers import CLIPProcessor
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
reference = to_tensor(image)
encoded_data = processor(
text=[""],
images=[reference],
return_tensors="pt",
max_length=77,
padding="max_length",
truncation=True,
)
print(transformers.__version__)
print(encoded_data.pixel_values.mean()) With 4.27.4 I get
With 4.28.0 I get
|
I figured out the issue: the CLIPProcessor expects tensors in the range [0, 255], but only starting from transformers 4.28.0. This seems a pretty breaking change to me! If I multiply my tensor by 255, I get the right results |
Hi, Thanks for reporting. This seems related to #23096 and may be caused by #22458. cc @amyeroberts |
Hi @andreaferretti, thanks for raising this issue! What's being observed, is actually a resolution of inconsistent behaviour of the previous CLIP feature extractors. I'll explain:
In the previous behaviour, images after resizing kept their upscaled values. Currently, if an image was upscaled during resizing, the pixel values are downscaled back e.g. to between 0-1. This ensures that the user can set Rather than try to infer processor behaviour based on inputs, we keep the processing behaviour consistent and let the user explicitly control this. If you wish to input images with pixel values that have been downscaled, then you just need to tell the image processor not to do any additional scaling using the outputs = image_processor(images, do_rescale=False) Alternatively, you could pass in the images without calling In the issues linked by @NielsRogge, this is also explained: #23096 (comment) However, this is the second time a similar issue has been raised, indicating that the behaviour is unexpected. I'll think about how to best address this with documentation or possible warning within the code. |
Yeah, it would be useful to add a warning mentioning |
I am still getting widely different results on the JAX implementation of from transformers import CLIPImageProcessor, CLIPVisionModelWithProjection
import torch
from PIL import Image
import jax
import numpy as np
from scenic.projects.baselines.clip import model as clip
def _clip_preprocess(images, size):
target_shape = images.shape[:-3] + (size, size, images.shape[-1])
images = jax.image.resize(images, shape=target_shape, method='bicubic')
images = clip.normalize_image(images)
return images
def get_image_in_format(image, size, format="pt"):
images = np.array(image) / 255.
images = np.expand_dims(images, 0)
pp_images = _clip_preprocess(images, size)
if format == "pt":
inputs = {}
inputs["pixel_values"] = torch.from_numpy(np.array(pp_images))
inputs["pixel_values"] = inputs["pixel_values"].permute(0, 3, 1, 2)
return inputs
inputs = pp_images
return inputs
# Comes from https://huggingface.co/datasets/diffusers/docs-images/blob/main/amused/glowing_512_2.png
image = Image.open("glowing_512_2.png")
processor = CLIPImageProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
model = CLIPVisionModelWithProjection.from_pretrained("openai/clip-vit-large-patch14-336").eval()
inputs = get_image_in_format(image, processor.crop_size["height"], format="pt")
with torch.no_grad():
output = model(**inputs)
temp = output.image_embeds[0, :4].numpy().flatten().tolist()
print(", ".join([str(f"{x:.4f}") for x in temp]))
print("=====Printing JAX model=====")
_CLIP_MODEL_NAME = 'vit_l14_336px'
_model = clip.MODELS[_CLIP_MODEL_NAME]()
_model_vars = clip.load_model_vars(_CLIP_MODEL_NAME)
input_image_size = clip.IMAGE_RESOLUTION[_CLIP_MODEL_NAME]
images = get_image_in_format(image, size=input_image_size, format="jax")
temp = np.asarray(image_embs[0, :4]).flatten().tolist()
print(", ".join([str(f"{x:.4f}") for x in temp])) Gives:
for what seems to be quite different for the exact same input. @sanchit-gandhi would you have a clue about it? |
Hi, Not sure if you're comparing apples-to-apples, when comparing the original CLIP repository to the Transformers one, they match: https://colab.research.google.com/drive/15ZhC32ovBKAU5JqC-kcIOntW_oU-JrkB?usp=sharing. Scenic is not the original implementation of CLIP so there might be some differences. I would first check whether the Scenic implementation outputs the same logits as the OpenAI CLIP repository. |
You are right: import clip
import torch
import jax
import numpy as np
from scenic.projects.baselines.clip import model as clip_scenic
inputs = np.random.randn(1, 336, 336, 3)
model, preprocess = clip.load("ViT-L/14@336px", device="cpu")
with torch.no_grad():
image = torch.from_numpy(inputs.transpose(0, 3, 1, 2))
image_features = model.encode_image(image).numpy()
print(image_features.shape)
temp = image_features[0, :4].flatten().tolist()
print(", ".join([str(f"{x:.4f}") for x in temp]))
print("=====Printing JAX model=====")
_CLIP_MODEL_NAME = 'vit_l14_336px'
_model = clip_scenic.MODELS[_CLIP_MODEL_NAME]()
_model_vars = clip_scenic.load_model_vars(_CLIP_MODEL_NAME)
images = jax.numpy.array(inputs)
image_embs, _ = _model.apply(_model_vars, images, None)
print(image_embs.shape)
temp = np.asarray(image_embs[0, :4]).flatten().tolist()
print(", ".join([str(f"{x:.4f}") for x in temp])) Gives: (1, 768)
-0.1827, 0.7319, 0.8779, 0.4829
=====Printing JAX model=====
(1, 768)
-0.0107, 0.0429, 0.0514, 0.0283 Sorry for the false alarm here. Have raised an issue: google-research/scenic#991. |
System Info
transformers
version: 4.28.0Who can help?
@amyeroberts
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
It seems to me that there is some regression starting from transformers 4.28.0 that affects the CLIP vision model and everything related to it.
In particular, I am having issue with
ClipSeg
For ClipSeg, I am able to use it and get the expected masks, essentially by literally following the example here:
Then
logits
contains the logits from which I can obtain a mask by something likeI tested this and it works reliably until transformers 4.27.4. But with transformers 4.28.0, I get masks that are completely black regardless of the input image.
ClipVisionModel
This is harder to describe, since it relies on an internal model. I have trained a model that makes use of the image embeddings generated by ClipVisionModel for custom subject generation. Everything works well until transformers 4.27.4. If I switch to 4.28.0, the generated image changes completely. The only change is installing 4.28.0.
In fact, if I save the embeddings generated by CLIPVisionModel with the two different versions for any random image, I see that they are different. to be sure, this is how I generate image embeddings:
For reference, I am using clip-vit-large-patch14
Expected behavior
I would expect CLIPVisionModel to give the same result on the same image, both in 4.27.4 and in 4.28.0
The text was updated successfully, but these errors were encountered: