A Generic Improvement to Deep Residual Networks Based on Gradient Flow.

Authors :: Santhanam V
Davis LS
Source :: IEEE transactions on neural networks and learning systems [IEEE Trans Neural Netw Learn Syst] 2020 Jul; Vol. 31 (7), pp. 2490-2499. Date of Electronic Publication: 2019 Aug 16.
Publication Year :: 2020
Abstract: Preactivation ResNets consistently outperforms the original postactivation ResNets on the CIFAR10/100 classification benchmark. However, these results surprisingly do not carry over to the standard ImageNet benchmark. First, we theoretically analyze this incongruity in terms of how the two variants differ in handling the propagation of gradients. Although identity shortcuts are critical in both variants for improving optimization and performance, we show that postactivation variants enable early layers to receive a diverse dynamic composition of gradients from effectively deeper paths in comparison to preactivation variants, enabling the network to make maximal use of its representational capacity. Second, we show that downsampling projections (while only a few in number) have a significantly detrimental effect on performance. We show that by simply replacing downsampling projections with identitylike dense-reshape shortcuts, the classification results of standard residual architectures such as ResNets, ResNeXts, and SE-Nets improve by up to 1.2% on ImageNet, without any increase in computational complexity (FLOPs).