-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to generate most matching text set #9
Comments
Hi. |
thanks for your clarification. |
Yes. |
In our setting, both the images and captions are perturbed, so we utilize the intermediate adversarial texts as the guidance to generate the adversarial images. In this way, the final adversarial images can have better attack ability compared to those generated using clean texts as guidance directly. If only the images are perturbed, it would be ok to ignore the step of generating the intermediate adversarial texts, |
Thanks for your explaination, as I understood, you are saying with the intermediate adversarial texts $t'{aug}$, the ASR is empirically better than using clean texts $t{aug}$. But I am sort of disagree with "both the images and captions are perturbed, so we utilize the intermediate adversarial texts as the guidance to generate the adversarial images", since it is easy to get adv image I think your zig-zag way of generating adversarial image and text subtly captures some multimodal interaction, as |
thanks for sharing your great work.
I am confused about this step
"we select the most matching caption pairs from the dataset of each image v to form an augmented caption set t ={t1, t2, ..., tM }"
because the dataset only gives single image-text pair, how could you find multiple matched texts for an single image
The text was updated successfully, but these errors were encountered: