Skip to content

Commit

Permalink
grouped conv
Browse files Browse the repository at this point in the history
  • Loading branch information
astonzhang committed Aug 22, 2022
1 parent 6dbc254 commit 55db13d
Showing 1 changed file with 5 additions and 5 deletions.
10 changes: 5 additions & 5 deletions chapter_convolutional-modern/resnet.md
Original file line number Diff line number Diff line change
Expand Up @@ -389,16 +389,16 @@ Different from the smorgasbord of transformations in Inception,
ResNeXt adopts the *same* transformation in all branches,
thus minimizing the need for manual tuning of each branch.

![The ResNeXt block. The use of group convolution with $g$ groups is $g$ times faster than a dense convolution. It is a bottleneck residual block when the number of intermediate channels $b$ is less than $c$.](../img/resnext-block.svg)
![The ResNeXt block. The use of grouped convolution with $g$ groups is $g$ times faster than a dense convolution. It is a bottleneck residual block when the number of intermediate channels $b$ is less than $c$.](../img/resnext-block.svg)
:label:`fig_resnext_block`

Breaking up a convolution from $c$ to $b$ channels into one of $g$ groups of size $c/g$ generating $g$ outputs of size $b/g$ is called, quite fittingly, a *group convolution*. The computational cost is reduced from $\mathcal{O}(c \cdot b)$ to $\mathcal{O}(g \cdot (c/g) \cdot (b/g)) = \mathcal{O}(c \cdot b / g)$, i.e. it is $g$ times faster. Even better, the number of parameters needed to generate the output is also reduced from a $c \times b$ matrix to $g$ smaller matrices of size $(c/g) \times (b/g)$, again a $g$ times reduction. In what follows we assume that both $b$ and $c$ are divisible by $g$.
Breaking up a convolution from $c$ to $b$ channels into one of $g$ groups of size $c/g$ generating $g$ outputs of size $b/g$ is called, quite fittingly, a *grouped convolution*. The computational cost is reduced from $\mathcal{O}(c \cdot b)$ to $\mathcal{O}(g \cdot (c/g) \cdot (b/g)) = \mathcal{O}(c \cdot b / g)$, i.e. it is $g$ times faster. Even better, the number of parameters needed to generate the output is also reduced from a $c \times b$ matrix to $g$ smaller matrices of size $(c/g) \times (b/g)$, again a $g$ times reduction. In what follows we assume that both $b$ and $c$ are divisible by $g$.

The only challenge in this design is that no information is exchanged between the $g$ groups. The ResNeXt block of
:numref:`fig_resnext_block` amends this in two ways: the group convolution with a $3 \times 3$ kernel is sandwiched in between two $1 \times 1$ convolutions. The second one serves double duty in changing the dimensionality from $b$ to $c$ again. The benefit is that we only pay the $\mathcal{O}(c \times b)$ cost for $1 \times 1$ kernels and can make do with an $\mathcal{O}(c \times b / g)$ cost for $3 \times 3$ kernels. Similar to the residual block implementation in
:numref:`fig_resnext_block` amends this in two ways: the grouped convolution with a $3 \times 3$ kernel is sandwiched in between two $1 \times 1$ convolutions. The second one serves double duty in changing the dimensionality from $b$ to $c$ again. The benefit is that we only pay the $\mathcal{O}(c \times b)$ cost for $1 \times 1$ kernels and can make do with an $\mathcal{O}(c \times b / g)$ cost for $3 \times 3$ kernels. Similar to the residual block implementation in
:numref:`subsec_residual-blks`, the residual connection is replaced (thus generalized) by a $1 \times 1$ convolution.

The right figure in :numref:`fig_resnext_block` provides a much more concise summary of the resulting network. It will also play a major role in the design of generic modern CNNs in :numref:`sec_cnn-design`. Note that the idea of group convolutions dates back to the implementation of AlexNet :cite:`Krizhevsky.Sutskever.Hinton.2012`. When distributing the network across two GPUs with limited memory, the implementation treated each GPU as its own channel with no ill effects.
The right figure in :numref:`fig_resnext_block` provides a much more concise summary of the resulting network. It will also play a major role in the design of generic modern CNNs in :numref:`sec_cnn-design`. Note that the idea of grouped convolutions dates back to the implementation of AlexNet :cite:`Krizhevsky.Sutskever.Hinton.2012`. When distributing the network across two GPUs with limited memory, the implementation treated each GPU as its own channel with no ill effects.

The following implementation of the `ResNeXtBlock` class takes as argument `groups` $g$, with
`bot_channels` $b$ intermediate (bottleneck) channels. Lastly, when we need to reduce the height and width of the representation, we add a stride of $2$ by setting `use_1x1conv=True, strides=2`.
Expand Down Expand Up @@ -534,7 +534,7 @@ adopts residual connections (together with other design choices) and is pervasiv
in areas as diverse as
language, vision, speech, and reinforcement learning.

ResNeXt is an example for how the design of convolutional neural networks has evolved over time: by being more frugal with computation and trading it off with the size of the activations (number of channels), it allows for faster and more accurate networks at lower cost. An alternative way of viewing group convolutions is to think of a block-diagonal matrix for the convolutional weights. Note that there are quite a few such 'tricks' that lead to more efficient networks. For instance, ShiftNet :cite:`wu2018shift` mimicks the effects of a $3 \times 3$ convolution, simply by adding shifted activations to the channels, offering increased function complexity, this time without any computational cost.
ResNeXt is an example for how the design of convolutional neural networks has evolved over time: by being more frugal with computation and trading it off with the size of the activations (number of channels), it allows for faster and more accurate networks at lower cost. An alternative way of viewing grouped convolutions is to think of a block-diagonal matrix for the convolutional weights. Note that there are quite a few such 'tricks' that lead to more efficient networks. For instance, ShiftNet :cite:`wu2018shift` mimicks the effects of a $3 \times 3$ convolution, simply by adding shifted activations to the channels, offering increased function complexity, this time without any computational cost.

A common feature of the designs we've discussed so far is that the network design is fairly manual, primarily relying on the ingenuity of the designer to find the 'right' network parameters. While clearly feasible, it is also very costly in terms of human time and there's no guarantee that the outcome is optimal in any sense. In Chapter :ref:`sec_cnn-design` we will discuss a number of strategies for obtaining high quality networks in a more automated fashion. In particular, we will review the notion of Network Design Spaces that led to the RegNetX/Y models
:cite:`Radosavovic.Kosaraju.Girshick.ea.2020` and the ConvNeXt architecture :cite:`liu2022convnet`.
Expand Down

0 comments on commit 55db13d

Please sign in to comment.