Vanishing Gradients in ML
- Python Automation and Machine Learning for ICs - - An Online Book - |
||||||||
Python Automation and Machine Learning for ICs http://www.globalsino.com/ICs/ | ||||||||
Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix | ||||||||
================================================================================= The problem of vanishing and exploding gradients is a common issue in the training of deep neural networks, especially in gradient-based optimization algorithms like stochastic gradient descent (SGD) or variants of it. This problem arises when the gradients of the loss function with respect to the model parameters become very small (vanishing) or very large (exploding) as they are backpropagated through the layers of the network during training. Vanishing gradients occur when the gradients of the loss function with respect to the parameters of the model become extremely small as they are backpropagated through the network. This means that the weights of the early layers in the network are updated very little, and as a result, these layers may not learn meaningful representations. The causes of vanishing gradients are:
The consequences of vanishing gradients are:
The strategies for mitigating exploding gradients are:
Given a neural network with the output, where, is the input to the network. is a 2x2 matrix: Then, the output is the result of multiplying several weight matrices together.Vanishing gradients occur when the gradients become very small during backpropagation, leading to negligible updates for the weights. In the given network, if the entries of the weight matrix W[L] are such that they are less than 1, and if this pattern continues through the earlier layers, the gradients can diminish exponentially as they are backpropagated through the network. For example, if a = 0.2 and L = 10, the repeated multiplication of these small values during backpropagation can result in vanishing gradients, particularly for the weights in the earlier layers. In this case, we have, Since a =0.2 and are both less than 1, each multiplication by the weight matrix W[L] will scale down the values in the input. After layers, this can lead to the vanishing gradient problem.Assuming all the weight matrices [L] are identical with the given values , , , and , then the network simplifies, and the output can be expressed as follows: Then, This means that after passing through 10 layers, the input is scaled down significantly, contributing to the vanishing gradient problem during backpropagation. The network may struggle to learn meaningful representations, especially in the earlier layers. The gradients with respect to the weights in those layers are likely to be very small, making the training process challenging. Mitigation strategies, such as weight initialization techniques, activation functions like ReLU, and normalization methods, become important to address the vanishing gradient problem and facilitate effective training in deep neural networks.============================================
|
||||||||
================================================================================= | ||||||||
|
||||||||