Skip Connection and Explanation of ResNet
Deep Residual Learning for Image Recognition
1. Explanation of Skip Connection
1.1. The update rule and vanishing gradient problem.
The update rule of gradient descent without momentum:
where
, and L is the loss function and lambda the learning rate.
Basically, you try to update the parameters by changing them with a small amount {delta w_i}, which was calculated based on the gradient.
Loss Function is a quantitative measure of the distance between input and output. Our goal is to minimize the loss function until it stops reducing or some predefined criteria are met.
In order to minimize Loss Function, we use Backpropagation. The concept of Backpropagation is to gradually minimize Loss Function by updating the parameters in the network.
1.2. Chain rule.
The chain rule describes the gradient change of a loss function Z with respect to some neural network parameters (i.e X and Y).
Let’s say X and Y are functions of a previous layer parameter. f, g, h be different layers perform non-linear operations in the input vector.
We want to express the gradient of Z with respect to the input.
Note: All these values are often less than 1. When we go backward in the network, the gradient of the network gets smaller and smaller.
1.3. Skip Connection.
Skip Connection is a standard module in many convolutional architectures. By using Skip Connection, we provide an alternative path for the gradient (with backpropagation). These additional paths are beneficial for model convergence.
Skip Connection: skip some layer in neural network and feeds the output of one layer as the input to the next layers (instead of only the next one).
When using the chain rule, we must keep multiplying terms with the error gradient as we go backward. However, in long chain of multiplication, if we multiply many things together that are less than 1, the resulting gradient will be very small. Therefore, the gradient becomes very small as we approach the earlier layers in a deep architecture. In some cases, the gradient becomes zero, meaning that we do not update the early layers at all.
There are two fundamental ways that we could use Skip Connections through different non-sequential layers:
- Addition as in residual architectures.
- Concatenation as in densely connected architecture.
2. Explanation of ResNet
2.1. Introduction
In order to solve a complex problem, we stack layers in the DNN which results in improved accuracy and performance.
The intuition behind adding more layers is that these layers progressively learn more complex features. For example, in case of recognizing images:
- The first layer may learn to detect edges.
- The second layer may learn to identify textures.
- The third layer can learn to detect objects.
However, when the network depth increases, accuracy gets saturated and then degrades rapidly.
Here is the plot describing errors in training and testing data for a 20 layers Network and 56 layers Network.
We can see that error for 56-layer is more than a 20-layer network in both training and testing data → This assumes that adding more layers makes its performance degrades.
It could be thought of as a result of overfitting, but the error of the 56-layer network is worst on both training and testing data. (If the network is overfitting, training error will be low and testing error will be high)
2.2. Residual Block
In order to solve the degradation problem, a deep residual learning framework is used. In this network, we use Skip Connection.
Skip Connection allows the model to learn the identity functions which ensures the higher layer will perform at least as good as the lower layer, and not worse.
→ One way of achieving so is if the additional layers in a deep network learn the identity function and thus their outputs equal inputs which do not allow them to degrade the performance even with extra layers.
→ The advantage of adding this type of skip connection is if any layer hurts the performance of the network, it will be skipped by regularisation.
The approach behind this network is instead of layers learning the underlying mapping, we allow the network to fit the residual mapping.
We assume that output of the shallow learned layer is x, and the desired underlying mapping is H(x). We let the layers learn to approximate a residual mapping F(x) = H(x) — x
In order to learn an identity function, $H(x)$ must be equal to $x$
→ We only need to set residual mapping $F(x)$ to zero, which is easier (by pushing the weights and biases of the upper weight layer to zero)
How does the identity connection affect the performance of the network?
During backpropagation, there are two pathways for the gradient to transit back to the input layer while traversing a residual block:
- Path 1: through the residual mapping way.
- Path 2: through the identity mapping way.
When we express the gradient of H(x)=F(x)+x with the respect to the input x:
As we can see from the above equation, even if the gradient of Path 1 vanished to 0, the gradient of Path 2 still remained (because this gradient does not encounter any weight layer).
There is a slight problem with this approach when the dimensions of the input variables from residual mapping, can happen with convolutional and pooling layers. In this case, when dimensions of F(x) are different from x, we can take two approaches:
- The skip connection is padded with extra zero entries to increase its dimensions. (no extra parameter)
- The projection shortcut method is used to match the dimension which is done by adding 1x1 convolutional layer to input. In this case, the output is:
2.3. ResNet architecture
There are many ResNet architectures. The table below shows types of ResNet, Building blocks with the number of blocks stacked.
🔹 We will go into detail about the 34-layer ResNet model:
Image Input size: 225x225x3 (1)
First, Image is passed through 7x7 filter with stride 2, 64 channels
→ Output size: 113x113x64 (2)
Next, it performs 3x3 max-pooling with stride 2
→ Output size: 57x57x64 (3)
Three residual blocks are performed. Each residual block has 3x3 filter, stride 1, 64 channel.
→ Output size: 57x57x64 (4)
Four residual blocks are performed:
- In the 1st residual block, the 1st filter has 3x3, stride 2, 128 channels
→ Output size: 29x29x128 (5)
- Because the output sizes of (4) and (5) are different, so the projection shortcut is used to make dimensions equal.
- The project shortcut is a convolutional layer with 1x1 kernel size, stride 2, 128 channels
The rest is same as above.
→ Final output size: 8x8x512
🔹 50-layer and more ResNet model:
Concerning on the training time, the authors modify the building block as a bottleneck design.
For each residual block, they use a stack of 3 layers instead of 2. Three layers are 1x1, 3x3, and 1x1 convolutions, where 1x1 layers are responsible for reducing and then increasing (restoring) dimensions, 3x3 layer is a bottleneck with smaller input/output dimensions.