You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for trying it out! Based on our observation, the re-attention's benefits are proportional to the number of "similar blocks" as defined in the paper. The number of similar blocks are typically small when the depth of the model is small as shown in Figure in Fig. 1 in the paper. However, you can try with the cosine similarity as regularization as shown in the updated paper. Besides, as the model is shallow, it is not necessary to apply re-attention for all blocks. You could refer to Fig. 9 in the appendix.
When I applied re-attention in Deit-S (https://github.com/facebookresearch/deit), no accuracy gain was observed. Could you give some advice?
The text was updated successfully, but these errors were encountered: