Accelerated resource scaling mechanisms for energy efficient deep learning cluster with power budget constraint딥 러닝 클러스터의 에너지 비용 절감을 위한 자원 스케일링 가속화 기법

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 177
  • Download : 0
Recently, demands of heterogeneous applications (web service processing, web streaming, image processing, big data analytics) that show various resource usage patterns (in views of processor-intensive, memory-intensive, and IO-intensive) have been rapidly increased by multiple service users, in high performance computing (HPC) clusters and data centers. Large-scale computing resources are being constructed for high-speed processing of parallel tasks that require a computation time of several days or more due to their millions of huge low data set. However, as the computing performance of clusters (e.g., data centers) rapidly improved, the associated power consumption also exploded. Currently, 1.3\% of the world 's total energy consumption, or 270TWh of energy use is derived only from data centers, and it is expected to rise to 8\% in 2020. This energy usage is a major obstacle to the expansion of data center infrastructure and the satisfaction of user service quality. Therefore, energy consumption cost for cluster (data center) operation is becoming the most important issue in the resource management in term of economic benefit. In modern HPC clusters, deep learning (DL) tasks for artificial intelligence services are emerging as major workloads in addition to existing HPC tasks and web service processing. Due to the high computing complexity of deep learning (DL) tasks, there is growing interest in GPU-enabled server racks for deployment of HPC clusters dedicated to fast data-rich task processing. Compared to conventional CPU devices, due to parallel processing capability, thousands of streaming processing (SP) cores contained in a single chip of a GPU device have a high processing speed for training of deep neural network (DNN) models that requires repetitive data processing. However, despite the improvements in their development efficiency in terms of flops-to-power, GPU-based clusters still generate non-negligible power consumption compared to CPU-based clusters. Therefore, a power consumption by GPU devices is recognized as a key component in managing the energy efficiency of the entire clusters. In this dissertation, we propose a elaborate and scalable power control algorithm and associated modeling approach to achieve the energy efficient GPU-enabled DL clusters. The first objective of the proposed approach is to enable the sophisticated and highly-scalable power control for GPU-enabled DL clusters give a limited power budget. Especially, we present a theoretical model transformation that can effectively separate the entire control problem so as to find the optimal control decision making in real-time for large-scale clusters under limited power budget. The second objective is to ensure service level agreements (SLAs) required by each DL service user while taking into account the dynamically vibrating electricity costs and renewable power capacity. The proposed algorithm can guarantee the stable energy consumption and acceptable service latency for large-scale clusters even with uncertainty of environmental variables. The technical contributions of this dissertation are following two parts. The first one is an accelerated power control algorithm for energy efficient deep learning processing in GPU-enabled data centers. We present a non-agnostic statistical modeling and energy cost minimization techniques for GPU-enabled DL clusters. We design the GPU architecture agnostic DL processing power consumption and processing performance modeling for heterogeneous GPU servers. By using this, we achieve the real-time online GPU model parameter estimation, by adopting a recursive least square (RLS) approach to our proposed system. Moreover, we propose a highly-scalable GPU power control algorithm based on dual acceleration. We utilize the Lagrangian dual decomposition technique to divide a large-scale power control problem into small sub-problems and enable a distributed and parallel computation for optimal control decision making. Especially, we use the Lipschitz-continuity so as to maximize the theoretical acceleration of dual-optimized convergence to optimal control solution. Our proposed distributed control architecture enables a run-time control optimization within a few seconds for large-scale GPU-enabled clusters containing hundreds of servers, by simply adding local power controllers. The second one is a MACRO and MICRO time scaled based resource management schemes for energy efficient deep learning services. We present a hierarchical time scale management for energy efficient data centers with low DRS switching overheads. We propose the multi time scales (MACRO and MICRO, MAMI) based approach for integrated DRS and FS to efficiently reduce the energy consumption of idle/active servers in data centers. Under MAMI based resource management, we can apply the sophisticated FS to data centers while minimizing DRS switching overheads in response to various environmental variables such as electricity market price, renewable power capacity, and given service quality requirements. Moreover, we propose a stochastic joint optimization technique for risk management of prediction error occurrence. We propose a multi-scenario based stochastic joint optimization method that mitigates the risk cost (undesirable energy consumption cost and unacceptable service latency cost) caused by uncertainty of environmental parameters such as electricity market price and renewable power capacity. We can derive a stable resource management decision that can efficiently cope with various possible real cases via the stochastic optimization approach based on multiple scenario generation technique. Moreover, we present the Logarithm based model transformation approach to convert the non-convex control optimization problem to convex one, so as to solve the problem by using conventional optimizer. In order to investigate the performance of our contributions in this dissertation, we deployed various DNN models such as AlexNet, ResNet, VGGNet and GoogleNet. We built a lab-scale testbed consists of multiple GPU servers based on NVIDIA Pascal architecture (GTX1060/1080), multiple power controllers and coordinator. The proposed statistical DL power and performance modeling method shows the high accuracy of our control decision making for invoked heterogeneous DL tasks, that is, it can minimize the deadline violation of each invoked DL task while ensuring the dynamic power budget constraint. Our dual acceleration approach based on Lipschitz-continuity guarantees a runtime optimal solution within a few seconds by using multiple local power controllers for large-scale clusters (having more than 200 GPU servers). Subsequently, we evaluate the performance of the proposed MAMI scale resource management and stochastic joint optimization approach for energy efficient large-scale data centers, by using real trace data retrieved from the Measurement and Instrumentation Data Center (MIDC) and the Federal Energy Regulatory Commission (FERC). We implemented the proposed system by using Keras deep learning framework. Our proposed MAMI/stochastic approach is able to achieve 25\% energy cost saving over existing meta-heuristics such as constraint genetic algorithm (GA) and reference point based power control methods, while ensuring the various quality of services from tight service latency to loosed one.
Advisors
Youn, Chan-Hyunresearcher윤찬현researcher
Description
한국과학기술원 :전기및전자공학부,
Publisher
한국과학기술원
Issue Date
2019
Identifier
325007
Language
eng
Description

학위논문(박사) - 한국과학기술원 : 전기및전자공학부, 2019.8,[ix, 148 p. :]

Keywords

GPU power control▼adeep learning cluster▼adistributed model predictive control▼aLipschitz continuity▼asequential quadratic programming method▼arenewable power generation▼aunsupervised learning▼along short term memory; GPU 전력 제어▼a딥 러닝 클러스터▼a분산 MPC 제어▼a신재생 에너지▼a륍쉬츠 연속성

URI
http://hdl.handle.net/10203/283273
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=871447&flag=dissertation
Appears in Collection
EE-Theses_Ph.D.(박사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0