Accelerating text generation by minimizing memory transfer in attention mechanism어텐션 메커니즘의 메모리 전송 최소화를 통한 텍스트 생성 가속

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 4
  • Download : 0
Text generation models based on autoregressive transformer models have been instrumental in advancing applications such as chatbot systems and virtual assistants. When the model generates text with multiple batching, the key/value pairs used in the attention mechanism cannot be shared, thus leading to prolonged execution time. As the attention mechanism is memory bounded, off-chip memory accesses should be minimized for faster execution. Although previous methods reduced the off-chip memory accesses regarding unimportant tokens, they fall short in selectively removing the negligible tokens in each instance. Rather, this dissertation estimates the weight using bit chunks of K vectors, effectively removing the memory accesses for low weight tokens and achieving an $12.1x$ pruning ratio without fine-tuning. Additionally, this dissertation present consecutive bit chunk request that prevents the underutilization of Processing Elements (PEs) induced by on-demand DRAM access. Finally, a dedicated hardware equipped with PEs and auxiliary modules is designed, which supports the proposed methods. As a result, it shows $2.6x$ reduced memory accesses, leading to an average $2.3x$ speedup and a $2.4x$ energy efficiency.
Advisors
김이섭researcher
Description
한국과학기술원 :전기및전자공학부,
Publisher
한국과학기술원
Issue Date
2024
Identifier
325007
Language
eng
Description

학위논문(석사) - 한국과학기술원 : 전기및전자공학부, 2024.2,[iv, 38 p. :]

Keywords

트랜스포머 구조▼a텍스트 생성▼a어텐션 메커니즘▼a인공지능 가속기 디자인▼a비순차적 실행; Transformer architecture▼atext generation▼aattention mechanism▼aAI accelerator design▼aOut-of-order processing

URI
http://hdl.handle.net/10203/321643
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=1097215&flag=dissertation
Appears in Collection
EE-Theses_Master(석사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0