(A) software framework for estimating training time of trillion-parameter scale distributed machine learning대규모 분산형 기계학습의 학습 시간 예측을 위한 소프트웨어 프레임워크

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 107
  • Download : 0
As the size of deep neural network (DNN) models is rapidly increasing to improve performance, the demand for compute resources required for DNN training is exponentially increasing. Such large-scale training is performed on distributed systems with various parallelism techniques, and the training performance in distributed systems varies drastically depending on DNN model architecture, the network topology, and the combination of parallelism techniques. However, finding the optimal training configuration incurs immense expenses, leading to the inability to effectively use compute resources in large-scale training. To address this, this thesis proposes a simulation framework to predict the training iteration time of distributed training. The proposed framework accurately predicts the training iteration time for various configurations with a mean absolute error of 12.80%, facilitating efficient exploration for the optimal training configuration.
Advisors
Rhu, Minsooresearcher유민수researcher
Description
한국과학기술원 :전기및전자공학부,
Publisher
한국과학기술원
Issue Date
2023
Identifier
325007
Language
eng
Description

학위논문(석사) - 한국과학기술원 : 전기및전자공학부, 2023.2,[iv, 24 p. :]

Keywords

Distributed training▼aDeep neural networks▼aSimulation▼aParallelization▼aGPU; 분산 학습▼a심층신경망▼a시뮬레이션▼a병렬화▼a그래픽처리장치

URI
http://hdl.handle.net/10203/309962
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=1033105&flag=dissertation
Appears in Collection
EE-Theses_Master(석사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0