- Title: Spatial Transformer Network
- Authors: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu
- Link: https://arxiv.org/abs/1506.02025
- Tags: Neural Network, Attention
- Year: 2015
-
What:
-
How:
Spatial Transformer
allows the spatial manipulation of the data (any feature map or particularly input image). This differentiable module can be inserted into any CNN, giving neural networks the ability to actively spatially transform feature maps, conditional on the feature map itself.- The action of the spatial transformer is conditioned on individual data samples, with the appropriate behavior learned during training for the task in question.
- No additional supervision or modification of the optimization process is required.
- Spatial manipulation consists of cropping, translation, rotation, scale, and skew.
- STN structure:
Localization net
: predicts parameters of the transformtheta
. For 2d case, it's 2 x 3 matrix. For 3d case, it's 3 x 4 matrix.Grid generator
: Uses predictions ofLocalization net
to create a sampling grid, which is a set of points where the input map should be sampled to produce the transformed output.Sampler
: Produces the output map sampled from the input feature map at the predicted grid points.
-
Notes:
- Localization net can predict several transformations(
thetas
) for subsequent transformation applied to the input image(feature map).- The final regression layer should be initialized to regress the identity transform (zero weights, identity transform bias).
- Grid generator and Transforms:
- The transformation can have any parameterized form, provided that it is differentiable with respect to the parameters
- The most popular is just a 2d affine transform:
or particularly an attention mechanism:
- The source/target transformation and sampling is equivalent to the standard texture mapping and coordinates used in graphics.
- Sampler:
- The key why STN works. They introduced a (sub-)differentiable sampling mechanism that allows loss gradients to flow back not only to the "input" feature map, but also to the sampling grid coordinates, and therefore back to the transformation parameters θ and Localisation Net.
- Localization net can predict several transformations(
-
Results: