ViTDet : propagation strategy #4731

Shiro-LK · 2023-01-02T18:39:50Z

Shiro-LK
Jan 2, 2023

Hi folks,

I just read the paper about ViTDet which give some insight about training a single resolution backbone for object detection.
I don't understand a little part of it when it mention the propagation strategy :

"A small
number of cross-window blocks (e.g., 4), which could be global attention [54] or
convolutions, are used to propagate information. These adaptations are made
only during fine-tuning and do not alter pre-training."

"(i) Global propagation. We perform global self-attention in the last block of
each subset. As the number of global blocks is small, the memory and computation cost is feasible. This is similar to the hybrid window attention in [34] that
was used jointly with FPN.
(ii) Convolutional propagation. As an alternative, we add an extra convolutional block after each subset. A convolutional block is a residual block [27] that
consists of one or more convolutions and an identity shortcut. The last layer in
this block is initialized as zero, such that the initial status of the block is an
identity [22]. Initializing a block as identity allows us to insert it into any place
in a pre-trained backbone without breaking the initial status of the backbone"

From my understand, if the backbone has 24 layers, every 6 layers propagation will happened. However not sure how it happens.
For convolutional propagation, I suppose it is done by reshaping the output of the subset to get an image shape and then apply a convolution and adding then a resnet connection.
But for the global propagation, I am not sure to understand how, as for me global attention is already use by default in this backbone.

Am I misunderstanding something ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ViTDet : propagation strategy #4731

{{title}}

Replies: 0 comments

Select a reply

ViTDet : propagation strategy #4731

Shiro-LK Jan 2, 2023

Replies: 0 comments

Shiro-LK
Jan 2, 2023