Batch Normalization

Batch Normalization

  3 minute read  

Batch Normalization

Problem
In deep neural networks, during training, as weights update the distribution of input values to hidden layers changes continuously, also called, β€˜internal covariate shift’ (ICS).
This change forces layers to constantly adapt to new input distributions, which :

  • slows down training,
  • hinders convergence, and
  • makes hyper-parameter tuning difficult

A deep neural network for digit classification

images/deep_learning/fundamentals/batch_normalization/digit_classification.png

Batch Normalization is a technique to control the variation in the features, such that, they do not vary too much and are bounded (by normalizing the inputs to each layer).

\[ \hat{x}_i = \frac{x_i - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}} \]

where, \(\epsilon \approx 10^{-5}\) is a tiny constant to prevent division by zero.

If we always normalize to mean \(\mu\)=0 and variance \(\sigma^2\)=1, we might restrict the layer too much (e.g., forcing everything into the linear region of a Sigmoid function) and the network might lose representational power.

πŸ’‘ So BatchNorm introduces learnable parameters:

\[y_i = \gamma \hat{x}_i + \beta\]
  • \(\gamma\) = scaling parameter
  • \(\beta\)= shifting parameter

Note: The network can now decide for itself if it wants the mean(\(\mu\)) to be 0 and variance(\(\sigma^2\)) to be 1.
If the optimal state for the network is something else, it can learn the values for β€˜\(\gamma\)’ and β€˜\(\beta\)’ to undo the normalization.

Inference Time

  • During training, mean(\(\mu\)) and variance(\(\sigma^2\)) come from current mini-batch.
  • At inference (test) time, we use frozen running averages of the mean(\(\mu\)) and variance(\(\sigma^2\)οΏΌ) calculated during training.

Research Paper: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Ioffe & Szegedy, 2015, https://arxiv.org/pdf/1502.03167

Batch Normalization Benefits
  • Mitigates changing distributions (internal covariate shift).
  • Prevents vanishing/exploding gradients.
    • Allows for higher learning rates.
  • Smoothing the optimization landscape.
  • Acts as a regularizer to reduce overfitting.
    • Because BN calculates the mean(\(\mu\)) and variance(\(\sigma^2\)) for each mini-batch, these statistics vary slightly across different batches. 
This randomness introduces a small amount of noise into the activations, which acts as a regularizer, similar to dropout.

Code: Batch Normalization
import tensorflow as tf
from tensorflow.keras import layers, models, regularizers
import numpy as np

# 1. Setup Synthetic Data (Binary Classification)
# 1000 samples, 20 features per sample
X = np.random.rand(1000, 20).astype(np.float32)
y = np.random.randint(2, size=(1000, 1)).astype(np.float32)

# 2. Define the Sequential Model
model = models.Sequential([
    # Input Layer
    layers.Input(shape=(20,)),

    # Hidden Layer 1: Dense + L2 Regularization
    layers.Dense(64, kernel_regularizer=regularizers.l2(0.01), name="dense_1"),
    layers.BatchNormalization(name="batch_norm_1"), # Normalizes activations
    layers.Activation('relu'),
    layers.Dropout(0.3, name="dropout_1"),          # Prevents overfitting

    # Hidden Layer 2
    layers.Dense(32, kernel_regularizer=regularizers.l2(0.01), name="dense_2"),
    layers.BatchNormalization(name="batch_norm_2"),
    layers.Activation('relu'),
    layers.Dropout(0.2, name="dropout_2"),

    # Output Layer (Sigmoid for binary probability)
    layers.Dense(1, activation='sigmoid', name="output_layer")
])

# 3. Compile the Model
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# 4. Print Architecture & Parameters
print("--- Model Architecture ---")
model.summary()

Output

--- Model Architecture ---
Model: "sequential_23"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┑━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
β”‚ dense_1 (Dense)                 β”‚ (None, 64)             β”‚         1,344 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ batch_norm_1                    β”‚ (None, 64)             β”‚           256 β”‚
β”‚ (BatchNormalization)            β”‚                        β”‚               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ activation_2 (Activation)       β”‚ (None, 64)             β”‚             0 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ dropout_1 (Dropout)             β”‚ (None, 64)             β”‚             0 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ dense_2 (Dense)                 β”‚ (None, 32)             β”‚         2,080 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ batch_norm_2                    β”‚ (None, 32)             β”‚           128 β”‚
β”‚ (BatchNormalization)            β”‚                        β”‚               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ activation_3 (Activation)       β”‚ (None, 32)             β”‚             0 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ dropout_2 (Dropout)             β”‚ (None, 32)             β”‚             0 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ output_layer (Dense)            β”‚ (None, 1)              β”‚            33 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 Total params: 3,841 (15.00 KB)
 Trainable params: 3,649 (14.25 KB)
 Non-trainable params: 192 (768.00 B)