Super Floating-Point (SuFP): Multi-region piecewise quantization with scalable bias다중 구역 정밀도를 가진 확장 가능한 바이어스를 이용한 양자화 기법

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 1
  • Download : 0
Deep Neural Networks (DNNs) are transforming numerous fields, but as they do so, the size of these models and their computational requirements are also growing at an exponential rate. In response to these challenges, various quantization techniques have emerged as highly effective solutions. However, quantization methods using conventional data types, including integer or floating-point, face certain limitations in balancing between accuracy drop and computational benefit. In light of the advent of hardware accelerator design for AI processing, quantization research has entered a new phase: custom data types and specialized hardware have emerged as innovative alternatives. Particularly, piecewise quantization and block floating-point quantization exhibit notable performance and efficiency improvements, but they still suffer from handling outliers with huge dynamic ranges. To solve this issue, we introduce Super Floating-Point (SuFP), a breakthrough data type and quantization method that improves both memory footprint and logic efficiency without compromising model accuracy. The key idea of SuFP is multi- region piecewise quantization using a tensor-wise scalable bias. It can configure an optimized precision for each region to capture both dense near-zero data and outliers. In addition, the scalable bias offers flexible adaptability to diverse data distributions, requiring only a single addition operation at the tensor level. Furthermore, the tailored hardware for SuFP employs only integer arithmetic units and shifters, facilitating a highly compact hardware realization. Our experimental results show that SuFP quantization achieves accuracy performance on par with, and in some cases even exceeds, that of full precision floating-point (FP32) across vision, language, and generative model benchmarks. Its computational capability and energy efficiency have shown improvements, with a 9.00× and 17.04× enhancement over FP32 implementations. These improvements are notable when compared to state-of-the-art MSFP and BSFP, which show up to 7.20× and up to 8.27×, respectively.
Advisors
김주영researcher
Description
한국과학기술원 :전기및전자공학부,
Publisher
한국과학기술원
Issue Date
2024
Identifier
325007
Language
eng
Description

학위논문(석사) - 한국과학기술원 : 전기및전자공학부, 2024.2,[iv, 34 p. :]

Keywords

훈련 후 양자화▼a조각별 양자화▼a블록 부동소수점 양자화▼a하드웨어 친화적 데이터 타입; Post-training quantization▼aPiecewise quantization▼aBlock floating-point quantization▼aHardware friendly data type

URI
http://hdl.handle.net/10203/321597
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=1097169&flag=dissertation
Appears in Collection
EE-Theses_Master(석사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0