Depth-estimation from a single input image can be used in various applications such as robotics and autonomous driving. UNet-styled networks with encoder/decoder structures have been widely used for monocular depth estimation based on supervised learning. Various studies have been attempted to reduce the amount of computation in the encoder, but research on saving the amount of computation in the decoder is relatively lacking. In general, in the decoder, an operation of increasing the image resolution while gradually reducing the channel size is repeated. If such processing can be performed at a time at a high magnification, the amount of computation in the decoder can be remarkably reduced.
To achieve this goal in a monocular image-based depth estimation network, we propose a new network structure with reduced convolution layers at the decoder part, namely, the Cocktail Glass Network (CGN). And to make this structure possible, we propose a new feature data transformation method, which is called Channel to Space Remapping (CSR), which directly moves and transforms the data accumulated in the channel direction to the image plane. Using this method, it is possible to convert low-resolution data of a thick channel into high-resolution data of a thin channel in a single layer.
The proposed method can be easily implemented using simple reshaping operations; therefore, it is suitable for reducing the depth-estimation network. Considering the experimental results based on the NYU V2 and KITTI datasets, we demonstrate that the proposed method reduces the amount of computation in the decoder by half, while maintaining the same level of accuracy; it can be used in both lightweight and large-model-capacity networks.
In the latter part of the paper, we show that the proposed method is particularly suitable for depth estimation networks, and we further propose a method to improve performance by adding MLP to CSR. And we suggest that CSR can be used for the purpose of reducing the amount of computation not only in the depth estimation network but also in the Super Resolution network.