This dissertation deals with lightweight encoder-decoder structured depth estimation. Through the thesis research, it is found that local texture information is very important even in the last layer of the network for the ligthweight depth estimation network, unlike other lightweight methods of computer vision. In addition, it is found that long range shape information is also important for network performance improvement.
Based on this knowledge, this thesis designs RRNet to capture long range shape information by increasing the number of layers without additional layer parameter cost due to RR blocks. In addition, we propose Condensed Dense Connection(CDC) that enables to preserving lightweight local texture information through dense connection and reduced the weight of the decoder by 16 times to the base model. Moreover, CDC plays a regularization role at training the parameter shared RR block. In addition, this network works well on TX2, a mobile GPU. Compared to other compatible networks, the amount of computation and number of parameters is significantly less, and the network shows quite fast performance in terms of computation speed. On CPU, the proposed RRNet can run as fast as the network without depthwise convolution.
Recently, recent depth estimation has developed to use the pretrained encoder from ImageNet classification. According to this trend, the second proposed method is a lightweight decoder, which can be applied to various encoders, so that its performance can be incrementally improved as the encoder will be enhanced. The proposed lightweight decoder method utilizes axial attention~\cite{wang2020axial}, which is one of self-attention approaches that are known to take long range shape information. However, this method causes local texture destroyed when all convolutions are replaced with axial attention. Axial attention is applied to all layers.~\cite{wang2020axial} in image segmentation or classification, where their performance has improved because these application do not deal with local texture at the end of the network. In order to overcome this texture vanishing problem this study places the axial attention layer at the front end of the decoder due to the study of StyleGAN, in which the generator fetched the shape features in the first and second layer. In addition, in order to achieve the same effect as applied to multiple layers while not losing local shape information by applying axial attention with as few strokes as possible, upsampling was performed 8 times at a time and the upsampled values are brought from axial attention.
By doing this, this thesis proposes a lightweight decoder network that preserves both long range shape information and local texture well. The proposed lightweight study evaluates its performances on the NYU v2 dataset and the KITTI dataset, and the performance has much improved on KITTI greatly. This fact confirms that the proposed method preserves long range shape information well because KITTI has homogeneous and long range shaped objects such as street and wall etc.
Finally, this lightweight depth estimation network has been expected to have high utility in a manufacturing environment. So, a dimension measurement is studied by using depth estimation. Dimension estimates in a manufacturing environment are discontinuous. However, depth estimation is a kind of regression problem in general. In addition, it is difficult to measure the exact dimensions depending only on the texture, as the textures in manufacturing objects are much more homogeneous than other situations. To overcome this problem, this thesis proposes a magnifier loss to amplify the minute changes in texture so that accurate dimension can be measured well.