Let it reuse : a multi-mode sparse attention inference accelerator with a unified multi-precision datapath통합된 다중 정밀도 데이터 연산을 통한 다중 모드 희소 어탠션 추론 가속기

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 59
  • Download : 0
DC FieldValueLanguage
dc.contributor.advisorKim, Lee-Sup-
dc.contributor.advisor김이섭-
dc.contributor.authorYeo, Unhak-
dc.date.accessioned2023-06-26T19:33:39Z-
dc.date.available2023-06-26T19:33:39Z-
dc.date.issued2022-
dc.identifier.urihttp://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=1008352&flag=dissertationen_US
dc.identifier.urihttp://hdl.handle.net/10203/309836-
dc.description학위논문(석사) - 한국과학기술원 : 전기및전자공학부, 2022.8,[iv, 34 p. :]-
dc.description.abstractTransformer-based models are rapidly emerging in various fields of DNNs. Therefore, accelerators for the self-attention mechanism, a bottleneck of the transformer, are actively studied today. However, for real-world accelerators, not only high performance but also generality and flexibility are necessary. First, because the required precision and datatype of each task are different, the accelerators should generally support multi-precision. Second, because the required accuracy, energy, and latency change depend on the scenarios, the accelerators should flexibly support the multi-mode without severe HW underutilization. Real-world accelerators need to deliver high performance even under the aforementioned functionalities. This paper shows that the prior design framework has reached its limit in terms of computational savings. This paper presents an interpretable design framework called "Let It Reuse." To effectively utilize this framework and satisfy real-world constraints, it takes a co-optimization approach, including an algorithm, architecture and microarchitecture. In detail, this paper proposes a multi-mode aware pipeline with a unified multi-precision datapath and explores reusability according to the datatype. As an experiment of the Question & Answering task, the Let It Reuse Accelerator improves the geomean speedup by 24 times and 4 times, respectively, compared to a GPU, an up-to-date Nvidia ampere architecture, and Sanger, a state-of-the-art attention accelerator.-
dc.languageeng-
dc.publisher한국과학기술원-
dc.subjectTransformer▼aSelf-attention Mechanism▼aSparse▼aMulti-mode▼aMulti-precision▼aCo-optimization-
dc.subject트랜스포머▼a셀프 어텐션 메커니즘▼a희소성▼a멀티 모드▼a멀티 정확도▼a통합 최적화-
dc.titleLet it reuse-
dc.title.alternative통합된 다중 정밀도 데이터 연산을 통한 다중 모드 희소 어탠션 추론 가속기-
dc.typeThesis(Master)-
dc.identifier.CNRN325007-
dc.description.department한국과학기술원 :전기및전자공학부,-
dc.contributor.alternativeauthor여운학-
dc.title.subtitlea multi-mode sparse attention inference accelerator with a unified multi-precision datapath-
Appears in Collection
EE-Theses_Master(석사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0