Hardware and software systems for accelerating large-scale deep learning recommendation models딥러닝 기반 대규모 추천시스템 가속을 위한 하드웨어 및 소프트웨어 시스템

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 1
  • Download : 0
Deep learning-based recommendation models (DLRMs) are widely used for conducting personalized recommendations, which employ learnable vector parameters, known as embeddings, representing individualized characteristics of users and recommended items such as media contents, products, and ads. A unique characteristic of DLRMs is that due to the embedding layer, the size of the recommender model scales proportional to the size of the online service. Consequently, the size of the DLRMs reaches terabyte-scale for massive-scale online services like Facebook, far exceeding the capacity of bandwidth-optimized accelerator memory. The memory capacity and bandwidth demand from these enlarged embedding layers bring new system-level challenges in training and deploying large-scale recommendation models. This dissertation addresses the bottlenecks of the large-scale deep learning recommendation models by proposing novel hardware and software systems. This dissertation first identifies that enlarged embedding layers cause major performance challenges in DLRMs. The study clarifies the computational characteristics of such layers and proposes a near-memory processing (NMP) based accelerator hardware that efficiently stores and processes these embeddings. The proposed vertically integrated hardware/software co-design encompasses the required microarchitecture, instruction set architecture (ISA), system architecture, software stack, and a workload parallelization algorithm. Furthermore, to expand the research scope of the NMP-based embedding acceleration to the training context, this dissertation presents an algorithm-architecture co-design, which establishes a theoretical foundation for hardware accelerator design for the embedding layer. Since such specialized hardware-based acceleration systems can fundamentally address the challenges posed by large embedding layers, developing and maintaining these systems require non-trivial costs. As a cost-effective solution, this dissertation also presents software optimization techniques. By utilizing the highly sparse and skewed access patterns of the embedding layers, this dissertation presents a software-managed caching system using high-bandwidth GPU memory to cache frequently accessed embedding entries. The proposed software system leverages a unique characteristic of the recommendation model training to perfectly prefetches soon-to-be-accessed embedding entries in advance to boost training speed. Lastly, the study analyzes challenges in developing software systems for utilizing the locality of the embedding layer during inference and proposes a new type of caching technique for the embedding layer. The proposed caching mechanism leverages massively parallelized address translation hardware in the accelerator to eliminate bottlenecks in the software-managed embedding cache, which is highly effective for recommendation inference acceleration.
Advisors
유민수researcher
Description
한국과학기술원 :전기및전자공학부,
Publisher
한국과학기술원
Issue Date
2024
Identifier
325007
Language
eng
Description

학위논문(박사) - 한국과학기술원 : 전기및전자공학부, 2024.2,[vi, 79 p. :]

Keywords

딥러닝▼a추천 시스템▼a컴퓨터 아키텍처▼a메모리 중심 아키텍처▼a가속컴퓨팅▼a임베딩; Deep learning▼aRecommendation system▼aComputer architecture▼aMemory-centric architecture▼aAccelerated computing▼aEmbedding

URI
http://hdl.handle.net/10203/322136
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=1100036&flag=dissertation
Appears in Collection
EE-Theses_Ph.D.(박사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0