One of the most important components that allows ResNets (or any Neural networks to be general) to be deeper is Batch Normalization. According to [1], it ensures good
signal propagation which allows us to train deeper networkds, without convolution activations getting exploded. However, there are certain caveats with
Batch-Normalization. It breaks the independence between training examples, memory overhead and also poses difficulty in replicating the trained models
on different hardware.
To counter the above disadvantages, I explored NFNet - Normalizer-Free ResNets, for the project, to free typical
ResNets from batch normalization for good. Not only they make the training faster but according to [2] even the smaller models
made of NFNets match the performance of EfficientNet (one of the SOTA models) on imagenet.
Freeing the Normalization: According to [4], in typical ResNets, Batch normalization downscales the input to each residual block by a factor proportional to the standard deviation of the input signal. And each residual block increase the variance of the signal by an almost constant factor. Keeping these two findings in mind, authors in [3], proposes the following modified residual block that mimics the above 2 findings. That is,
where xl denotes the input to the lth residual block,
Since we normalize the data,
Recifying activation induced mean shifts: According to [3], it was observed that changing the residual block form, although helped, did introduced few practical challenges that arose from the mean shifts seen in hidden activations. To curb this mean shift and ensure that the variance in the residual branches are preserved (from exploding), scaled weight standardization, inspired from [5], is proposed by the authors in [3]. The authors suggest that we re-parameterize the weights of the convolution layers through the training in forward pass as below
Where
[1] [https://arxiv.org/pdf/2101.08692.pdf] [2] [https://arxiv.org/pdf/2102.06171.pdf] [3] [https://arxiv.org/pdf/2101.08692.pdf] [4] [https://proceedings.neurips.cc/paper/2020/file/e6b738eca0e6792ba8a9cbcba6c1881d- Paper.pdf] [5] [https://arxiv.org/pdf/1903.10520.pdf]