With the advent of Artificial Intelligence (AI) and Big Data era, the volume of data that users need to process has increased exponentially. In this circumstance, traditional computing architecture, particularly the von Neumann architecture, has experienced significant performance degradation due to the data bottlenecks between memory and compute units. To address this issue, the processing-in-memory (PIM) architecture, which integrates compute units in memory, has emerged and gained prominence as the next-generation computing architecture. However, in current computer architecture, various types of memory exist hierarchically and each has different interconnect, capacity and access speed characteristics. As a result, these various memory characteristics require different considerations at the circuit, architecture, and system levels in the implementation of PIM architecture.
In this paper, we introduce various PIM architecture research on different memory hierarchy. Within these various research, the paper addresses the essential considerations for implementing PIM architecture at each memory location. It is primarily divided into two main sections: one for near-memory processing based PIM architecture research on the storage level, and the other for in-memory processing based PIM architecture research on the cache/SRAM level. At the storage level, we present near-memory processing based PIM architectures for data-intensive applications, such as large-scale graph-based nearest neighbor search and advanced data analytics. Each research introduces a novel hardware acceleration platform for target application with leveraging a computational storage device that can directly access storage data. More specifically, we provide the hardware architecture that can leverage the bandwidth benefit of computational storage device and the detailed microarchitecture for accelerating the target computations. In addition, we develop a software stack to enable seemly integration of computational storage device in a new acceleration platform. At the cache/SRAM level, we present in-memory processing based PIM architectures for deep learning application. As the memory and the processor are located in the same side without any interconnect, this scenario does not have much bandwidth gains when using near-memory processing. As a result, unlike the storage level, we focus on developing in-memory processing based PIM architecture and memory cell structure capable of performing deep learning operations with low power consumption and high performance. With these aggregation, we finally present a novel reconfigurable architecture that can leverage the strengths of both PIM and traditional von Neumann architecture. Based on this, we present an energy-efficient multi-DNN hardware accelerator and also provide the new scheduling and compilation techniques for its efficient processing. In conclusion, this paper describes the utility of PIM architecture for a various applications in machine learning and deep learning fields, shedding light on hardware architecture society.