how to generate most matching text set #9

luoyuchenmlcv · 2024-03-21T02:03:13Z

thanks for sharing your great work.

I am confused about this step
"we select the most matching caption pairs from the dataset of each image v to form an augmented caption set t ={t1, t2, ..., tM }"

because the dataset only gives single image-text pair, how could you find multiple matched texts for an single image

Zoky-2020 · 2024-03-21T02:21:24Z

Hi.
In the test sets of MSCOCO and Flickr-30k datasets, each image is associated with approximately 4-6 captions. For datasets where an image is paired with a single caption, generating additional matching captions using large language models presents a viable alternative.

luoyuchenmlcv · 2024-03-21T02:31:08Z

thanks for your clarification.
One more question, can I understand the goal of SGA attack in Image-retrival and Text-retrival way. You generate adv image and use it to retrive text, but the retrived text is not correctly describing the content of the adv image. You generate adv text and use it to retrive image, but the retrived image is not correctly corresponding to the adv text.

Zoky-2020 · 2024-03-21T02:55:04Z

Yes.

luoyuchenmlcv · 2024-03-21T05:29:18Z

I am a bit confused by SGA process.

In my understanding,

let's say $(v, t)$ is a pre-defined positive pair in the dataset，we want to generate $v'$ and $t'$, such that $(v', t)$ is a negative pair，and $(v, t')$ is also a negative pair.

first you do self-augmentation: $t_{aug} = [t_1, ..., t_M]$, $v_{aug} = [v_1, ..., v_N]$

Hence, describe in set level, we want $(v', t_{aug})$ are negative pairs，and $(v_{aug}, t')$ are also negative pairs.

To generate such $v^{\prime}$, why not directly contrast with $t_{aug}$, and $v^{\prime}$ can be calculated as:
$v^{\prime}=\underset{v^{\prime} \in B\left[v, \epsilon_v\right]}{\arg \max }-\sum_M \sum_N \frac{f_T\left(t_i\right)}{\left|f_T\left(t_i\right)\right|} \frac{f_I\left(v_i\right)}{\left|f_I\left(v_i)\right)\right|}$

and $t'$ can be generate in a similar way.

However, you first generate $t'_{aug} = [t'_1, ..., t'_M]$

then make $(t^{\prime}_{aug}, v)$ are negative pairs, then you calculate $v'$ as follows:

$v^{\prime}=\underset{v^{\prime} \in B\left[v, \epsilon_v\right]}{\arg \max }-\sum_M \sum_N \frac{f_T\left(t'_i\right)}{\left|f_T\left(t'_i\right)\right|} \frac{f_I\left(v_i\right)}{\left|f_I\left(v_i)\right)\right|}$

I don't really get why you add such additional step: generating $t^{\prime}_{aug} = [t'_1, ..., t'_M]$

Zoky-2020 · 2024-03-21T08:03:27Z

In our setting, both the images and captions are perturbed, so we utilize the intermediate adversarial texts as the guidance to generate the adversarial images. In this way, the final adversarial images can have better attack ability compared to those generated using clean texts as guidance directly. If only the images are perturbed, it would be ok to ignore the step of generating the intermediate adversarial texts, $t_{aug}^{'}$.

luoyuchenmlcv · 2024-03-21T08:51:16Z

Thanks for your explaination, as I understood, you are saying with the intermediate adversarial texts $t'{aug}$, the ASR is empirically better than using clean texts $t{aug}$.

But I am sort of disagree with "both the images and captions are perturbed, so we utilize the intermediate adversarial texts as the guidance to generate the adversarial images", since it is easy to get adv image $v'$ via contrasting with $t_{aug}$ and adv text $t'$ via contrasting with $v_{aug}$:

$v^{\prime}=\underset{v^{\prime} \in B\left[v, \epsilon_v\right]}{\arg \max }-\sum_M \sum_N \frac{f_T\left(t_i\right)}{\left|f_T\left(t_i\right)\right|} \frac{f_I\left(v_i\right)}{\left|f_I\left(v_i)\right)\right|}$

$t^{\prime}=\underset{t^{\prime} \in B\left[t, \epsilon_t\right]}{\arg \max }-\sum_M \sum_N \frac{f_T\left(t_i\right)}{\left|f_T\left(t_i\right)\right|} \frac{f_I\left(v_i\right)}{\left|f_I\left(v_i)\right)\right|}$

I think your zig-zag way of generating adversarial image and text subtly captures some multimodal interaction, as $t'_{aug}$ and $v'$ as intermediate steps to generate $v'$ and $t'$, it seems not quite intuitive, if you could explain why, I would be very grateful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to generate most matching text set #9

how to generate most matching text set #9

luoyuchenmlcv commented Mar 21, 2024

Zoky-2020 commented Mar 21, 2024

luoyuchenmlcv commented Mar 21, 2024

Zoky-2020 commented Mar 21, 2024

luoyuchenmlcv commented Mar 21, 2024 •

edited

Loading

Zoky-2020 commented Mar 21, 2024

luoyuchenmlcv commented Mar 21, 2024

how to generate most matching text set #9

how to generate most matching text set #9

Comments

luoyuchenmlcv commented Mar 21, 2024

Zoky-2020 commented Mar 21, 2024

luoyuchenmlcv commented Mar 21, 2024

Zoky-2020 commented Mar 21, 2024

luoyuchenmlcv commented Mar 21, 2024 • edited Loading

Zoky-2020 commented Mar 21, 2024

luoyuchenmlcv commented Mar 21, 2024

luoyuchenmlcv commented Mar 21, 2024 •

edited

Loading