From 55db13d06db4be402603c2afe7a5f192a6fc70b7 Mon Sep 17 00:00:00 2001 From: Aston Zhang Date: Mon, 22 Aug 2022 02:03:39 +0000 Subject: [PATCH] grouped conv --- chapter_convolutional-modern/resnet.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/chapter_convolutional-modern/resnet.md b/chapter_convolutional-modern/resnet.md index ca34e89f03..c84c0595a7 100644 --- a/chapter_convolutional-modern/resnet.md +++ b/chapter_convolutional-modern/resnet.md @@ -389,16 +389,16 @@ Different from the smorgasbord of transformations in Inception, ResNeXt adopts the *same* transformation in all branches, thus minimizing the need for manual tuning of each branch. -![The ResNeXt block. The use of group convolution with $g$ groups is $g$ times faster than a dense convolution. It is a bottleneck residual block when the number of intermediate channels $b$ is less than $c$.](../img/resnext-block.svg) +![The ResNeXt block. The use of grouped convolution with $g$ groups is $g$ times faster than a dense convolution. It is a bottleneck residual block when the number of intermediate channels $b$ is less than $c$.](../img/resnext-block.svg) :label:`fig_resnext_block` -Breaking up a convolution from $c$ to $b$ channels into one of $g$ groups of size $c/g$ generating $g$ outputs of size $b/g$ is called, quite fittingly, a *group convolution*. The computational cost is reduced from $\mathcal{O}(c \cdot b)$ to $\mathcal{O}(g \cdot (c/g) \cdot (b/g)) = \mathcal{O}(c \cdot b / g)$, i.e. it is $g$ times faster. Even better, the number of parameters needed to generate the output is also reduced from a $c \times b$ matrix to $g$ smaller matrices of size $(c/g) \times (b/g)$, again a $g$ times reduction. In what follows we assume that both $b$ and $c$ are divisible by $g$. +Breaking up a convolution from $c$ to $b$ channels into one of $g$ groups of size $c/g$ generating $g$ outputs of size $b/g$ is called, quite fittingly, a *grouped convolution*. The computational cost is reduced from $\mathcal{O}(c \cdot b)$ to $\mathcal{O}(g \cdot (c/g) \cdot (b/g)) = \mathcal{O}(c \cdot b / g)$, i.e. it is $g$ times faster. Even better, the number of parameters needed to generate the output is also reduced from a $c \times b$ matrix to $g$ smaller matrices of size $(c/g) \times (b/g)$, again a $g$ times reduction. In what follows we assume that both $b$ and $c$ are divisible by $g$. The only challenge in this design is that no information is exchanged between the $g$ groups. The ResNeXt block of -:numref:`fig_resnext_block` amends this in two ways: the group convolution with a $3 \times 3$ kernel is sandwiched in between two $1 \times 1$ convolutions. The second one serves double duty in changing the dimensionality from $b$ to $c$ again. The benefit is that we only pay the $\mathcal{O}(c \times b)$ cost for $1 \times 1$ kernels and can make do with an $\mathcal{O}(c \times b / g)$ cost for $3 \times 3$ kernels. Similar to the residual block implementation in +:numref:`fig_resnext_block` amends this in two ways: the grouped convolution with a $3 \times 3$ kernel is sandwiched in between two $1 \times 1$ convolutions. The second one serves double duty in changing the dimensionality from $b$ to $c$ again. The benefit is that we only pay the $\mathcal{O}(c \times b)$ cost for $1 \times 1$ kernels and can make do with an $\mathcal{O}(c \times b / g)$ cost for $3 \times 3$ kernels. Similar to the residual block implementation in :numref:`subsec_residual-blks`, the residual connection is replaced (thus generalized) by a $1 \times 1$ convolution. -The right figure in :numref:`fig_resnext_block` provides a much more concise summary of the resulting network. It will also play a major role in the design of generic modern CNNs in :numref:`sec_cnn-design`. Note that the idea of group convolutions dates back to the implementation of AlexNet :cite:`Krizhevsky.Sutskever.Hinton.2012`. When distributing the network across two GPUs with limited memory, the implementation treated each GPU as its own channel with no ill effects. +The right figure in :numref:`fig_resnext_block` provides a much more concise summary of the resulting network. It will also play a major role in the design of generic modern CNNs in :numref:`sec_cnn-design`. Note that the idea of grouped convolutions dates back to the implementation of AlexNet :cite:`Krizhevsky.Sutskever.Hinton.2012`. When distributing the network across two GPUs with limited memory, the implementation treated each GPU as its own channel with no ill effects. The following implementation of the `ResNeXtBlock` class takes as argument `groups` $g$, with `bot_channels` $b$ intermediate (bottleneck) channels. Lastly, when we need to reduce the height and width of the representation, we add a stride of $2$ by setting `use_1x1conv=True, strides=2`. @@ -534,7 +534,7 @@ adopts residual connections (together with other design choices) and is pervasiv in areas as diverse as language, vision, speech, and reinforcement learning. -ResNeXt is an example for how the design of convolutional neural networks has evolved over time: by being more frugal with computation and trading it off with the size of the activations (number of channels), it allows for faster and more accurate networks at lower cost. An alternative way of viewing group convolutions is to think of a block-diagonal matrix for the convolutional weights. Note that there are quite a few such 'tricks' that lead to more efficient networks. For instance, ShiftNet :cite:`wu2018shift` mimicks the effects of a $3 \times 3$ convolution, simply by adding shifted activations to the channels, offering increased function complexity, this time without any computational cost. +ResNeXt is an example for how the design of convolutional neural networks has evolved over time: by being more frugal with computation and trading it off with the size of the activations (number of channels), it allows for faster and more accurate networks at lower cost. An alternative way of viewing grouped convolutions is to think of a block-diagonal matrix for the convolutional weights. Note that there are quite a few such 'tricks' that lead to more efficient networks. For instance, ShiftNet :cite:`wu2018shift` mimicks the effects of a $3 \times 3$ convolution, simply by adding shifted activations to the channels, offering increased function complexity, this time without any computational cost. A common feature of the designs we've discussed so far is that the network design is fairly manual, primarily relying on the ingenuity of the designer to find the 'right' network parameters. While clearly feasible, it is also very costly in terms of human time and there's no guarantee that the outcome is optimal in any sense. In Chapter :ref:`sec_cnn-design` we will discuss a number of strategies for obtaining high quality networks in a more automated fashion. In particular, we will review the notion of Network Design Spaces that led to the RegNetX/Y models :cite:`Radosavovic.Kosaraju.Girshick.ea.2020` and the ConvNeXt architecture :cite:`liu2022convnet`.