Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to generate most matching text set #9

Open
luoyuchenmlcv opened this issue Mar 21, 2024 · 6 comments
Open

how to generate most matching text set #9

luoyuchenmlcv opened this issue Mar 21, 2024 · 6 comments

Comments

@luoyuchenmlcv
Copy link

thanks for sharing your great work.

I am confused about this step
"we select the most matching caption pairs from the dataset of each image v to form an augmented caption set t ={t1, t2, ..., tM }"

because the dataset only gives single image-text pair, how could you find multiple matched texts for an single image

@Zoky-2020
Copy link
Owner

Hi.
In the test sets of MSCOCO and Flickr-30k datasets, each image is associated with approximately 4-6 captions. For datasets where an image is paired with a single caption, generating additional matching captions using large language models presents a viable alternative.

@luoyuchenmlcv
Copy link
Author

thanks for your clarification.
One more question, can I understand the goal of SGA attack in Image-retrival and Text-retrival way. You generate adv image and use it to retrive text, but the retrived text is not correctly describing the content of the adv image. You generate adv text and use it to retrive image, but the retrived image is not correctly corresponding to the adv text.

@Zoky-2020
Copy link
Owner

Yes.

@luoyuchenmlcv
Copy link
Author

luoyuchenmlcv commented Mar 21, 2024

I am a bit confused by SGA process.
image

In my understanding,

let's say $(v, t)$ is a pre-defined positive pair in the dataset,we want to generate $v'$ and $t'$, such that $(v', t)$ is a negative pair,and $(v, t')$ is also a negative pair.

first you do self-augmentation: $t_{aug} = [t_1, ..., t_M]$, $v_{aug} = [v_1, ..., v_N]$

Hence, describe in set level, we want $(v', t_{aug})$ are negative pairs,and $(v_{aug}, t')$ are also negative pairs.

To generate such $v^{\prime}$, why not directly contrast with $t_{aug}$, and $v^{\prime}$ can be calculated as:
$v^{\prime}=\underset{v^{\prime} \in B\left[v, \epsilon_v\right]}{\arg \max }-\sum_M \sum_N \frac{f_T\left(t_i\right)}{\left|f_T\left(t_i\right)\right|} \frac{f_I\left(v_i\right)}{\left|f_I\left(v_i)\right)\right|}$

and $t'$ can be generate in a similar way.

However, you first generate $t'_{aug} = [t'_1, ..., t'_M]$

then make $(t^{\prime}_{aug}, v)$ are negative pairs, then you calculate $v'$ as follows:

$v^{\prime}=\underset{v^{\prime} \in B\left[v, \epsilon_v\right]}{\arg \max }-\sum_M \sum_N \frac{f_T\left(t'_i\right)}{\left|f_T\left(t'_i\right)\right|} \frac{f_I\left(v_i\right)}{\left|f_I\left(v_i)\right)\right|}$

I don't really get why you add such additional step: generating $t^{\prime}_{aug} = [t'_1, ..., t'_M]$

@Zoky-2020
Copy link
Owner

In our setting, both the images and captions are perturbed, so we utilize the intermediate adversarial texts as the guidance to generate the adversarial images. In this way, the final adversarial images can have better attack ability compared to those generated using clean texts as guidance directly. If only the images are perturbed, it would be ok to ignore the step of generating the intermediate adversarial texts, $t_{aug}^{'}$.

@luoyuchenmlcv
Copy link
Author

Thanks for your explaination, as I understood, you are saying with the intermediate adversarial texts $t'{aug}$, the ASR is empirically better than using clean texts $t{aug}$.

But I am sort of disagree with "both the images and captions are perturbed, so we utilize the intermediate adversarial texts as the guidance to generate the adversarial images", since it is easy to get adv image $v'$ via contrasting with $t_{aug}$ and adv text $t'$ via contrasting with $v_{aug}$:

$v^{\prime}=\underset{v^{\prime} \in B\left[v, \epsilon_v\right]}{\arg \max }-\sum_M \sum_N \frac{f_T\left(t_i\right)}{\left|f_T\left(t_i\right)\right|} \frac{f_I\left(v_i\right)}{\left|f_I\left(v_i)\right)\right|}$

$t^{\prime}=\underset{t^{\prime} \in B\left[t, \epsilon_t\right]}{\arg \max }-\sum_M \sum_N \frac{f_T\left(t_i\right)}{\left|f_T\left(t_i\right)\right|} \frac{f_I\left(v_i\right)}{\left|f_I\left(v_i)\right)\right|}$

I think your zig-zag way of generating adversarial image and text subtly captures some multimodal interaction, as $t'_{aug}$ and $v'$ as intermediate steps to generate $v'$ and $t'$, it seems not quite intuitive, if you could explain why, I would be very grateful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants