The advancement of Deep Neural Networks (DNNs) has significantly transformed our daily lives through various Computer Vision (CV) applications. Tasks such as image classification, object recognition, and motion detection, previously handled by distinct algorithms, have been integrated into DNN-based algorithms, exhibiting superior performance. Furthermore, the development of DNN architectures, including the Vision Transformer (ViT), has continuously improved the performance of computer vision applications.
However, the substantial computational requirements and data volumes necessary for DNN-based image processing pose challenges to the commercialization of such applications. Especially for real-time interactive computer vision applications, which are primarily processed on resource-constrained edge devices like mobile and IoT devices, the computational demands and data sizes of DNNs become significant obstacles. For instance, energy consumption to load data from memory shortens device battery life, and the execution time of applications is extended due to numerous multiplication operations on limited computational resources.
Therefore, this thesis proposes a hardware-algorithm co-optimization technique to reduce energy consumption and execution time required for DNN-based computer vision applications. Firstly, for energy reduction, the goal is to share data, namely model weights and feature values, among multiple computer vision tasks. To achieve weight sharing between tasks, a transfer learning technique is introduced, which avoids altering the backbone network's weights during model training for a specific task. Additionally, a feature value sharing technique, utilizing image characteristics, reduces memory requirements for storing feature values. To maximize the benefits of these algorithm techniques, a hardware architecture for weight and feature value processing per task, along with a data flow that enables data sharing between tasks, is proposed, resulting in significant energy savings.
Secondly, to reduce execution time required for computer vision applications, methods for reducing the number of feature values are presented. A Token Merging technique, adapted and optimized for computer vision applications, is used to address the limitations of Token Pruning, commonly used in transformer model compression. Furthermore, a hardware architecture is proposed to efficiently process the light-weight Vision Transformer. Unit designs for Token Merging, along with a new pipeline architecture to minimize associated overheads, significantly reduce the overall execution time of deep neural network models.
These strategies collectively contribute to mitigating the challenges posed by the computational demands and data sizes of DNNs, making them more practical and efficient for various computer vision applications.