You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I just read the paper about ViTDet which give some insight about training a single resolution backbone for object detection.
I don't understand a little part of it when it mention the propagation strategy :
"A small
number of cross-window blocks (e.g., 4), which could be global attention [54] or
convolutions, are used to propagate information. These adaptations are made
only during fine-tuning and do not alter pre-training."
"(i) Global propagation. We perform global self-attention in the last block of
each subset. As the number of global blocks is small, the memory and computation cost is feasible. This is similar to the hybrid window attention in [34] that
was used jointly with FPN.
(ii) Convolutional propagation. As an alternative, we add an extra convolutional block after each subset. A convolutional block is a residual block [27] that
consists of one or more convolutions and an identity shortcut. The last layer in
this block is initialized as zero, such that the initial status of the block is an
identity [22]. Initializing a block as identity allows us to insert it into any place
in a pre-trained backbone without breaking the initial status of the backbone"
From my understand, if the backbone has 24 layers, every 6 layers propagation will happened. However not sure how it happens.
For convolutional propagation, I suppose it is done by reshaping the output of the subset to get an image shape and then apply a convolution and adding then a resnet connection.
But for the global propagation, I am not sure to understand how, as for me global attention is already use by default in this backbone.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi folks,
I just read the paper about ViTDet which give some insight about training a single resolution backbone for object detection.
I don't understand a little part of it when it mention the propagation strategy :
"A small
number of cross-window blocks (e.g., 4), which could be global attention [54] or
convolutions, are used to propagate information. These adaptations are made
only during fine-tuning and do not alter pre-training."
"(i) Global propagation. We perform global self-attention in the last block of
each subset. As the number of global blocks is small, the memory and computation cost is feasible. This is similar to the hybrid window attention in [34] that
was used jointly with FPN.
(ii) Convolutional propagation. As an alternative, we add an extra convolutional block after each subset. A convolutional block is a residual block [27] that
consists of one or more convolutions and an identity shortcut. The last layer in
this block is initialized as zero, such that the initial status of the block is an
identity [22]. Initializing a block as identity allows us to insert it into any place
in a pre-trained backbone without breaking the initial status of the backbone"
From my understand, if the backbone has 24 layers, every 6 layers propagation will happened. However not sure how it happens.
For convolutional propagation, I suppose it is done by reshaping the output of the subset to get an image shape and then apply a convolution and adding then a resnet connection.
But for the global propagation, I am not sure to understand how, as for me global attention is already use by default in this backbone.
Am I misunderstanding something ?
Beta Was this translation helpful? Give feedback.
All reactions