From 2a9f02789e683fa0a77fe5f72250093fe65cb416 Mon Sep 17 00:00:00 2001 From: Ishaan <103370340+0xD4rky@users.noreply.github.com> Date: Mon, 30 Sep 2024 01:33:10 +0530 Subject: [PATCH 1/4] Update resnet.mdx --- chapters/en/unit2/cnns/resnet.mdx | 44 ++++++++++++++++++++++++++++--- 1 file changed, 40 insertions(+), 4 deletions(-) diff --git a/chapters/en/unit2/cnns/resnet.mdx b/chapters/en/unit2/cnns/resnet.mdx index 4b53d792f..7d18ca7cb 100644 --- a/chapters/en/unit2/cnns/resnet.mdx +++ b/chapters/en/unit2/cnns/resnet.mdx @@ -22,14 +22,50 @@ ResNet’s residual connections unlocked the potential of the extreme depth, pro - A Residual Block. Source: ResNet Paper ![residual](https://huggingface.co/datasets/hf-vision/course-assets/blob/main/ResnetBlock.png) -ResNet’s building blocks designed as identity functions, preserve input information while enabling learning. This approach ensures efficient weight optimization and prevents degradation as the network becomes deeper. +ResNets introduce a concept called residual learning, which allows the network to learn the residuals (i.e., the difference between the learned representation and the input), instead of trying to directly map inputs to outputs. This is achieved through skip connections (or shortcut connections). -The building block of ResNet can be shown in the picture, source ResNet paper. + +Let's break this down: + +#### 1. Basic Building Block: Residual Block + +In a typical neural network layer, we aim to learn a mapping F(x), where x is the input and F is the transformation the network applies. Without residual learning, the transformation at a layer is: y = F(x). In ResNets, instead of learning F(x) directly, the network is designed to learn the residual R(x), where: R(x) = F(x) − x. Thus, the transformation in a residual block is written as: y = F(x) + x. + +Here, x is the input to the residual block, and F(x) is the output of the block's stacked layers (usually convolutions, batch normalization, and ReLU). The identity mappingx is directly added to the output of F(x) (through the skip connection). So the block is learning the residual R(x) = F(x), and the final output is F(x) + x. This residual function is easier to optimize compared to the original mapping. If the optimal transformation is close to the identity (i.e., no transformation is needed), the network can easily set F(x) ≈ 0 to pass through the input as it is. + + + +* The building block of ResNet can be shown in the picture, source ResNet paper. ![resnet_building_block](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/ResnetBlock.png) -Shortcut connections perform identity mapping and their output is added to the output of the stacked layers. Identity shortcut connections add neither extra parameters nor + +#### 2. Learning Residuals and Gradient Flow + +The primary reason ResNets work well for very deep networks is due to improved gradient flow during backpropagation. Let’s analyze the backward pass mathematically. Let's say you have the residual block y = F(x) + x. When calculating gradients via the chain rule, we compute the derivative of the loss 𝐿 with respect to the input x. For a standard block (without residuals), the gradient is: + +`∂x/∂L = ∂F(x)/∂L * ∂F(x)/∂x` + +However, for the residual block, where y = F(x) + x, the gradient becomes: + +`∂L/∂x = ∂L/∂F(x) * ∂F(x)/∂x + ∂L/∂F(x) ⋅1` + +Notice that we now have an additional term 1 in the gradient calculation. This means that the gradient at each layer has a direct path back to earlier layers, improving gradient flow and reducing the chance of vanishing gradients. The gradients flow more easily through the network, allowing deeper networks to train without degradation. + + + + +#### 3. Why Deeper Networks are Now Possible: + +ResNets made it feasible to train networks with hundreds or even thousands of layers. Here’s why deeper networks benefit from this: + +* Identity Shortcut Connections: Shortcut connections perform identity mapping and their output is added to the output of the stacked layers. Identity shortcut connections add neither extra parameters nor computational complexity, these connections bypass layers, creating direct paths for information flow, and they enable neural networks to learn the residual function (F). +* Better Gradient Flow: As explained earlier, the residuals help gradients propagate better during backpropagation, addressing the vanishing gradient problem in very deep networks. +* Easier to Optimize: By learning residuals, the network is essentially breaking down the learning process into easier, incremental steps. It’s easier for the network to learn the residual R(x) = F(x) − x than it is to learn F(x) directly, especially in very deep networks. + + + We can summarize ResNet Network -> Plain Network + Shortcuts! @@ -87,4 +123,4 @@ print(model.config.id2label[predicted_label]) - [ResNet: Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) - [Resnet Architecture Source: ResNet Paper](https://arxiv.org/abs/1512.03385) -- [Hugging Face Documentation on ResNet](https://huggingface.co/docs/transformers/en/model_doc/resnet) \ No newline at end of file +- [Hugging Face Documentation on ResNet](https://huggingface.co/docs/transformers/en/model_doc/resnet) From e2f03e85370ae79fbcd8fee0a7687ebe97ed3e52 Mon Sep 17 00:00:00 2001 From: Ishaan <103370340+0xD4rky@users.noreply.github.com> Date: Mon, 30 Sep 2024 01:40:29 +0530 Subject: [PATCH 2/4] Update resnet.mdx --- chapters/en/unit2/cnns/resnet.mdx | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/chapters/en/unit2/cnns/resnet.mdx b/chapters/en/unit2/cnns/resnet.mdx index 7d18ca7cb..f95993c0a 100644 --- a/chapters/en/unit2/cnns/resnet.mdx +++ b/chapters/en/unit2/cnns/resnet.mdx @@ -12,6 +12,8 @@ An obstacle to answering this question, the gradient vanishing problem, was addr However, a new issue emerged: the degradation problem. As the neural networks became deeper, accuracy saturated and degraded rapidly. An experiment comparing shallow and deep plain networks revealed that deeper models exhibited higher training and test errors, suggesting a fundamental challenge in training deeper architectures effectively. This degradation was not because of overfitting but because the training error increased when the network became deeper. The added layers did not approximate the identity function. +Degradation-error + ResNet’s residual connections unlocked the potential of the extreme depth, propelling the accuracy upwards compared to the previous architectures. @@ -65,9 +67,9 @@ computational complexity, these connections bypass layers, creating direct paths * Easier to Optimize: By learning residuals, the network is essentially breaking down the learning process into easier, incremental steps. It’s easier for the network to learn the residual R(x) = F(x) − x than it is to learn F(x) directly, especially in very deep networks. +### Summarizing: - -We can summarize ResNet Network -> Plain Network + Shortcuts! +We can conclude that ResNet Network -> Plain Network + Shortcuts! For operations \(F(x) + x\), \(F(x)\) and \(x\) should have identical dimensions. ResNet employs two techniques to achieve this: From 46f7c0a63d221091e2ce0f9a6efd48abf8cf10ca Mon Sep 17 00:00:00 2001 From: Ishaan <103370340+0xD4rky@users.noreply.github.com> Date: Tue, 1 Oct 2024 08:59:18 +0530 Subject: [PATCH 3/4] Update chapters/en/unit2/cnns/resnet.mdx Co-authored-by: A Taylor <112668339+ATaylorAerospace@users.noreply.github.com> --- chapters/en/unit2/cnns/resnet.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit2/cnns/resnet.mdx b/chapters/en/unit2/cnns/resnet.mdx index f95993c0a..ffa88ed75 100644 --- a/chapters/en/unit2/cnns/resnet.mdx +++ b/chapters/en/unit2/cnns/resnet.mdx @@ -33,7 +33,7 @@ Let's break this down: In a typical neural network layer, we aim to learn a mapping F(x), where x is the input and F is the transformation the network applies. Without residual learning, the transformation at a layer is: y = F(x). In ResNets, instead of learning F(x) directly, the network is designed to learn the residual R(x), where: R(x) = F(x) − x. Thus, the transformation in a residual block is written as: y = F(x) + x. -Here, x is the input to the residual block, and F(x) is the output of the block's stacked layers (usually convolutions, batch normalization, and ReLU). The identity mappingx is directly added to the output of F(x) (through the skip connection). So the block is learning the residual R(x) = F(x), and the final output is F(x) + x. This residual function is easier to optimize compared to the original mapping. If the optimal transformation is close to the identity (i.e., no transformation is needed), the network can easily set F(x) ≈ 0 to pass through the input as it is. +Here, x is the input to the residual block, and F(x) is the output of the block's stacked layers (usually convolutions, batch normalization, and ReLU). The identity mapping x is directly added to the output of F(x) (through the skip connection). So the block is learning the residual R(x) = F(x), and the final output is F(x) + x. This residual function is easier to optimize compared to the original mapping. If the optimal transformation is close to the identity (i.e., no transformation is needed), the network can easily set F(x) ≈ 0 to pass through the input as it is. From d2493074196688622ea481e957a5ea8b3a7f55a3 Mon Sep 17 00:00:00 2001 From: Ishaan <103370340+0xD4rky@users.noreply.github.com> Date: Thu, 24 Oct 2024 11:16:45 +0530 Subject: [PATCH 4/4] Update chapters/en/unit2/cnns/resnet.mdx Co-authored-by: Johannes Kolbe <2843485+johko@users.noreply.github.com> --- chapters/en/unit2/cnns/resnet.mdx | 3 --- 1 file changed, 3 deletions(-) diff --git a/chapters/en/unit2/cnns/resnet.mdx b/chapters/en/unit2/cnns/resnet.mdx index ffa88ed75..90e6302a3 100644 --- a/chapters/en/unit2/cnns/resnet.mdx +++ b/chapters/en/unit2/cnns/resnet.mdx @@ -54,9 +54,6 @@ However, for the residual block, where y = F(x) + x, the gradient becomes: Notice that we now have an additional term 1 in the gradient calculation. This means that the gradient at each layer has a direct path back to earlier layers, improving gradient flow and reducing the chance of vanishing gradients. The gradients flow more easily through the network, allowing deeper networks to train without degradation. - - - #### 3. Why Deeper Networks are Now Possible: ResNets made it feasible to train networks with hundreds or even thousands of layers. Here’s why deeper networks benefit from this: