U-PIM: a unified processing-in-memory architecture with multiple dataflows for machine learning inference and training인공지능 학습과 추론을 위한 다중 데이터 플로우 통합 인-메모리 연산 아키텍처

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 113
  • Download : 0
As artificial intelligence and machine learning technology are revolutionizing our daily life, many accelerator architectures have been proposed for faster and more energy-efficient processing for the workloads. However, the von Neumann bottleneck often limits their system performance, which states the unavoidable data bandwidth problem caused by separated computation and memory units. The processing-inmemory paradigm, which merges logic into memory, becomes increasingly popular to address this system bottleneck. In this paper, we propose a unified processing-in-memory (U-PIM) architecture, which supports both inference and training for various deep learning models, including MLPs, CNNs, RNNs, and transformers. U-PIM comprises an array of SRAM-based PIM macros and an embedded DRAM, where the macros work on the tiled workloads and the eDRAM provides a global memory space. U-PIM allows various data flows based on the proposed tile scheduling algorithms, including forward propagation, error propagation, and weight update for end-to-end on-chip training. It also supports variable bit precision ranging from 1-bit to 16-bit for inference scenarios with quantized models. Throughout the entire processing, UPIM efficiently handles sparsity for better performance and energy efficiency. To validate the U-PIM architecture, we implement the U-PIM macro that contains an 8T-cell-based 3-way processing memory and a 6T-cell-based weight update memory along with bit-serial-based accumulation logic in a compact footprint of 0.315mm$^2$ in 28nm process. With the 64 macros in an 8×8 array, U-PIM achieves 0.31-18.18 TOPS inference performance for several layers from popular models. Finally, we demonstrate that U-PIM can successfully train the VGG16 model for the CIFAR100 dataset with a negligible loss in accuracy. As a result, it achieves 1.29 TOPS/W power efficiency and 7.65 GOPS/mm$^2$ area efficiency in the training, which are 186.24 times more power efficient and 2.8 times more area efficient than Nvidia TITAN X GPU.
Advisors
Kim, Joo-Youngresearcher김주영researcher
Description
한국과학기술원 :전기및전자공학부,
Publisher
한국과학기술원
Issue Date
2022
Identifier
325007
Language
eng
Description

학위논문(석사) - 한국과학기술원 : 전기및전자공학부, 2022.2,[iv, 28 p. :]

URI
http://hdl.handle.net/10203/309858
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=997248&flag=dissertation
Appears in Collection
EE-Theses_Master(석사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0