Deep Learning

64_Normalization

elif 2024. 2. 2. 19:57

When training a neural network through a network, if the date to be trained contains extremely large or small values, normalizing the data to a consistent level is very efficient in the learning process. Therefore, normalization is considered a necessary and essential part of neural networks.

 

For example, the presence of extremely large data makes gradient descent training much more difficult. Assuming a single-layer regression network with two weights, when the two corresponding input variables have very different ranges, a change in one weight results in a much larger change in the output and error function than a change in the other weight.

If dealing with continuous input variables, it is useful to first calculate the mean and variance to normalize the input values to a similar range.

 

 

Having calculated the mean and variance once using the above formula, the input values are then readjusted using the following equation.

 

 

Therefore, using the above formula ensures that the values have a mean of 0 and a variance of 1. It is very important to use the same scaling values for validation and test data as well.

 

Up to now, we have discussed the importance of normalizing input data, and this reasoning can be extended to the variables in each hidden layer of a deep neural network. If there are large fluctuations in the range of activation values in a particular hidden layer, normalizing these values to have a mean of 0 and a variance of 1 can simplify the learning problems in subsequent layers. However, unlike normalization of input values, normalization for hidden layers needs to be performed repeatedly during training as the weight values are updated, and this is referred to as Batch Normalization.

Additionally, Batch Normalization becomes even more necessary to suppress gradient vanishing and gradient exploding when training very deep neural netwoks. From the chain rule of differentiation, the gradient of the error function $E$ with respect to the parameters of the first layer of the neural network is as follows.

 

 

Where $z_j^k$ represents the activation of node $j$ in layer $k$, and each of the partial derivatives represents an element of the Jacobian matrix for that layer. These multiple products tend to converge to 0 when most of the terms are smaller than 1, and converge to infinity when most of the terms are larger than 1. Consequently, as the depth of the neural network increases, the gradient of the error function can become very large or very small. This issue can be resolved through Batch Normalization.

 

In a multilayer network, a specific layer's hidden units calculate a nonlinear function of the pre-activation ${z_i} = h({a_i})$, so either ${a_i}$ or ${z_i}$ can be normalized. The process of normalizing the pre-activation values is applied to each mini-batch separately, as the weight values are updated after each mini-batch. Specifically, for a mini-batch of size $K$, it can be defined as follows.

 

 

Where the sum over $n = 1, \cdots ,K$ is taken over the elements of the mini-batch. Here, $\delta $ is a very small constant applied to avoid numerical instability when $\sigma _i^2$ is very small. By normalizing the pre-activation in a given layer of the network, the degrees of freedom of the parameters in that layer are reduced. Therefore, to compensate for this, the pre-activation of the mini-batch is re-scaled using the mean ${\beta _i}$ and standard deviation ${\gamma _i}$.

 

 

Where ${\beta _i}$ and ${\gamma _i}$ are adaptive parameters that are learned along with the weights and biases of the neural network through gradient descent. It might seem that the above formula simply nullifies the effect of batch normalization. However, in the original network, the mean and variance within a mini-batch are determined as complex functions of all the weights and biases of that layer. In contrast, in the above formula, they are directly determined by the independent parameters ${\beta_i}$ and ${\gamma _i}$, which can be much more easily learned through gradient descent. Such batch normalization layers can be added after each hidden layer.

 

After training the network, when making predictions with new data, the training mini-batches can no longer be used, and it's not feasible to determine the mean and variance from just one data example. While it's possible to calculate ${\mu _i}$ and $\sigma _i^2$ for each layer over the entire training data after the final update of weights and biases, this method is too costly as it requires processing the entire dataset. Therefore, moving averages are calculated during the training phase

 

 

Where $0 \leqslant \alpha  \leqslant 1$. These moving averages do not play a role during training, but are used in the inference phase to process new data points.

 

In the case of batch normalization, if the mini-batch size is too small, the estimates of the mean and variance can be too large. Additionally, with very large training sets, mini-batches can be split across different GPUs, making global normalization inefficient. Therefore, to normalize each hidden unit individually within a mini-batch, normalization can be performed for the hidden unit values on a per data point basis. This is known as layer normalization.

 

 

Where the sum over $i = 1, \cdots ,M$ is performed over all hidden units in the layer. Like batch normalization, separate learnable mean and standard deviation parameters are introduced for each hidden unit. As the same normalization function can be used during both training and inference, there is no need to store moving averages.

'Deep Learning' 카테고리의 다른 글

66_Jacobian and Hessian Matrix  (1) 2024.02.04
65_Backpropagation  (0) 2024.02.03
63_Learning Rate Schedule  (0) 2024.02.01
62_Gradient Information  (0) 2024.01.31
61_Gradient Descent  (0) 2024.01.30