Managing heterogeneous SLO-aware machine learning inference tasks for gpu accelerated serversGPU 가속기 서버에서의 SLO를 고려한 이종 기계학습 추론 작업 관리 기법 연구

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 146
  • Download : 0
As machine learning is applied to a wide range of applications, high throughput machine learning (ML) inference servers accelerated by GPUs have become critical for online service applications. Such ML inference servers pose two challenges: first, it must provide a bounded latency for each request to support consistent server level objective (SLO), and second, it must serve multiple heterogeneous ML models in a system as certain tasks involve invocation of multiple models and consolidating multiple models improve system utilization. Therefore, this dissertation addresses the two requirements of ML inference servers, by proposing a new ML task inference scheduling paradigm based on hardware-assisted resource partitioning for GPUs. This first part of this dissertation illustrates how partitioning can be used to improve GPU utilization of ML inference tasks, resulting in enhanced throughput. Conventional SLO-oriented schedulers which rely on timesharing and batching to schedule multi models cannot fully utilize GPUs, mainly due to the limited batch size caused by SLO constraints. Hence, a new key mechanism is proposed in this dissertation which exploits the hardware support for spatial partitioning of GPU resources. With the partitioning mechanism, a new abstraction layer of GPU resources is created with configurable GPU resources. The scheduler assigns requests to virtual GPUs, called gpulets, with the most effective amount of resources. Our prototype implementation shows that the spatial partitioning can significantly improve overall throughput by providing better utilization of GPUs while satisfying SLOs compared to conventional time-sharing schedulers. Unlike the prior work, the scheduler explores three-dimensional search space with different batch sizes, temporal sharing, and spatial sharing efficiently. The second part of this dissertation investigates a remedy for potential interference effects when two ML tasks are running concurrently in a GPU. When two ML tasks are running on a same GPU, both share a portion of shared GPU resource leading to interference. We have modeled an analytical model to predict the amount of latency overhead by utilizing profiled characteristics of each ML inference task. The model was used to predict latency overhead for a given pair of tasks. We evaluate the proposed interference model by checking whether it can successfully verify a given scheduling result. Our experimental results indicate that it can help verify faulty scheduling schemes which cause SLO violation. Lastly, we explore how gpulets can be extended to schedule tasks to multiple heterogeneous GPUs in a distributed environment. ML inference tasks cannot fully utilize a GPU and do not require multiple homogeneous GPU per task unlike training tasks. In order to provide a cost-effective ML service server and satisfy SLO at the same time, an SLO-aware scheduler which can efficiently utilize heterogeneous GPUs is required. In order to solve this challenge, we propose a scheduling scheme by extending the concept of gpulets proposed in the first part of this dissertation. Additionally, Our prototype framework auto-scales the required number of GPUs for a given workloads, minimizing the cost for cloud-based inference servers.
Advisors
Huh, Jaehyukresearcher허재혁researcher
Description
한국과학기술원 :전산학부,
Publisher
한국과학기술원
Issue Date
2022
Identifier
325007
Language
eng
Description

학위논문(박사) - 한국과학기술원 : 전산학부, 2022.8,[v, 67 p. :]

Keywords

Machine learning▼aGPU▼aInference▼aMulti-model execution▼aSLO; 기계학습▼a그래픽 처리 장치▼a추론 작업▼a다중 딥 러닝 모델▼a서비스 수준 계약

URI
http://hdl.handle.net/10203/309277
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=1007883&flag=dissertation
Appears in Collection
CS-Theses_Ph.D.(박사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0