Start Over

Rich global feature guided network for monocular depth estimation.

Authors :: Wu, Bingyuan
Wang, Yongxiong
Source :: Image & Vision Computing. Sep2022, Vol. 125, pN.PAG-N.PAG. 1p.
Publication Year :: 2022
Abstract: Monocular depth estimation is a classical but challenging task in the field of computer vision. In recent years, Convolutional Neural Network (CNN) based models have been developed to estimate high-quality depth map from a single image. Most recently, some Transformer based models have led to great improvements. All the researchers are looking for a better way to handle the global processing of information which is crucial for depth relation inference but of high computational complexity. In this paper, we take advantage of both the Transformer and CNN then propose a novel network architecture, called Rich Global Feature Guided Network (RGFN), with which rich global features are extracted from both encoder and decoder. The framework of the RGFN is the typical encoder-decoder for dense prediction. A hierarchical transformer is implemented as the encoder to capture multi-scale contextual information and model long-range dependencies. In the decoder, the Large Kernel Convolution Attention (LKCA) is adopted to extract global features from different scales and guide the network to recover fine depth maps from low spatial resolution feature maps progressively. What's more, we apply the depth-specific data augmentation method, Vertical CutDepth, to boost the performance. Experimental results on both the indoor and outdoor datasets demonstrate the superiority of the RGFN compared to other state-of-the-art models. Compared with the most recent method AdaBins, RGFN improves the RMSE score by 4.66% on the KITTI dataset and 4.67% on the NYU Depth v2 dataset. [ABSTRACT FROM AUTHOR]