(A) batch orchestration algorithm for straggler mitigation of synchronous SGD in heterogeneous GPU cluster이기종 GPU 클러스터에서 동기식 SGD의 스트래글러 완화를 위한 배치 오케스트레이션 알고리즘 연구

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 356
  • Download : 0
so various researches have been conducted on accelerating the training speed through distributed processing. Data parallelism is one of the widely used distributed training schemes, and various algorithms for the data parallelism have been studied. However, since most of the studies assumed homogeneous computing environment, there is a problem that they do not consider a heterogeneous performance graphics processing unit (GPU) cluster environment induced by rapid performance changes of GPU types. Heterogeneous performance GPU clusters have performance differences between workers. It leads to differences in computation time between GPU workers in synchronous data parallelism, in which the total global mini-batch size is usually divided equally among several workers. Due to the time difference of the computation time of one iteration, the straggler problem that fast workers wait for the slowest worker makes the training speed low. In this thesis, we proposed a batch-orchestrating algorithm (BOA), reducing training time by improving hardware efficiency in heterogeneous performance GPU clusters. The proposed algorithm coordinates local mini-batch sizes for all workers to reduce one training iteration time. Additionally, we conducted performance tuning by searching better GPU worker set. We confirmed that the proposed algorithm improved the performance by 23% over the synchronous SGD with one back-up worker of training ResNet-194 with 8 GPUs of three different types: GTX 1080, GTX1060 and QuadroM2000. The proposed BOA solved the problem caused by the performance difference between GPU workers, and it accelerated the convergence speed of training.; Training deep learning model is time consuming
Advisors
Youn, Chan-Hyunresearcher윤찬현researcher
Description
한국과학기술원 :전기및전자공학부,
Publisher
한국과학기술원
Issue Date
2018
Identifier
325007
Language
eng
Description

학위논문(석사) - 한국과학기술원 : 전기및전자공학부, 2018.2,[iii, 46 p. :]

Keywords

deep learning▼adistributed training▼asynchronous SGD▼astraggler problem▼amini-batch; 딥러닝 분산 학습▼a동기적 확률 그라디언트 하강▼a배치 오케스트레이션▼a스트레글러

URI
http://hdl.handle.net/10203/266871
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=734028&flag=dissertation
Appears in Collection
EE-Theses_Master(석사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0