Optimizing the aggregate throughput of concurrent deep learning jobs on a shared cluster공유 클러스터에서 동시 딥 러닝 작업의 총 처리성능 최적화

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 333
  • Download : 0
The explosive popularity of deep learning (DL) has led to the evolution of deep learning frameworks. Unfortunately, despite the need for running multiple deep learning jobs on a GPU shared cluster environment, current cloud schedulers are often insufficient to schedule them efficiently. Managing resources for deep learning models without enough information or expertise results in poor performance in scalability and adversely affects the overall cluster performance. In this paper, we present Max-Speedup, a job scheduling policy of multi-tenant deep learning jobs on a shared GPU cluster. We address two main challenges, 1) precise estimation of training throughput to analyse resource-performance trade-off of a deep learning model and 2) efficient scheduling policy for multi-tenant deep learning jobs on a shared GPU cluster. We tackle these problems by estimating the finish time of parameter synchronization and maximizinig aggregate speedup by exploiting performance-resource trade-offs of DL jobs. Our evaluation shows that Max-Speedup improves the average job completion time by 3x over SRTF while it reduces makespan by up to 26.9x.
Advisors
Park, Kyoung Sooresearcher박경수researcher
Description
한국과학기술원 :전기및전자공학부(반도체학제전공),
Publisher
한국과학기술원
Issue Date
2018
Identifier
325007
Language
eng
Description

학위논문(석사) - 한국과학기술원 : 전기및전자공학부(반도체학제전공), 2018.8,[3, 27 p. :]

Keywords

Job scheduler▼adeep learning▼aperformance estimation▼aGPU cluster▼aresource management; 작업 스케줄러▼a딥러닝▼a성능 예측▼aGPU 클러스터▼a자원 관리

URI
http://hdl.handle.net/10203/266783
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=828576&flag=dissertation
Appears in Collection
EE-Theses_Master(석사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0