Cooperative Distributed GPU Power Capping for Deep Learning Clusters

Cited 5 time in webofscience Cited 0 time in scopus
  • Hit : 1480
  • Download : 0
DC FieldValueLanguage
dc.contributor.authorKang, Dong-Kiko
dc.contributor.authorHa, Yungiko
dc.contributor.authorPeng. Limeiko
dc.contributor.authorYoun, Chan-Hyunko
dc.date.accessioned2022-02-25T06:41:08Z-
dc.date.available2022-02-25T06:41:08Z-
dc.date.created2021-09-09-
dc.date.created2021-09-09-
dc.date.created2021-09-09-
dc.date.issued2022-07-
dc.identifier.citationIEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, v.69, no.7, pp.7244 - 7254-
dc.identifier.issn1557-9948-
dc.identifier.urihttp://hdl.handle.net/10203/292394-
dc.description.abstractnetwork (DNN) models, and high computational complexity. Thus, the traditional power capping methods for CPU-based clusters or small-scale GPU devices do not apply to the GPU-based clusters handling DL tasks. This paper develops a cooperative distributed GPU power capping (CD-GPC) system for GPU-based clusters, aiming to minimize the training completion time of invoked DL tasks without exceeding the limited power budget. Specifically, we first design the frequency scaling (FS) approach using the online model estimation based on the recursive least square (RLS) method. This approach achieves the accurate tuning for DL task training time and power usage of GPU devices without needing offline profiling. Then, we formulate the proposed FS problem as a Lagrangian dual decomposition-based economic model predictive control (EMPC) problem for large-scale heterogeneous GPU clusters. We conduct both the NVIDIA GPU-based lab-scale real experiments and real job trace-based simulation experiments for performance evaluation. Experimental results validate that the proposed system improves the power capping accuracy to have a mean absolute error <1%, and reduces the deadline violation ratio of invoked DL tasks by 21.5% compared with other recent counterparts.-
dc.languageEnglish-
dc.publisherIEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC-
dc.titleCooperative Distributed GPU Power Capping for Deep Learning Clusters-
dc.typeArticle-
dc.identifier.wosid000753527500074-
dc.identifier.scopusid2-s2.0-85110848368-
dc.type.rimsART-
dc.citation.volume69-
dc.citation.issue7-
dc.citation.beginningpage7244-
dc.citation.endingpage7254-
dc.citation.publicationnameIEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS-
dc.identifier.doi10.1109/TIE.2021.3095790-
dc.contributor.localauthorYoun, Chan-Hyun-
dc.contributor.nonIdAuthorKang, Dong-Ki-
dc.contributor.nonIdAuthorPeng. Limei-
dc.description.isOpenAccessN-
dc.subject.keywordAuthorDeep learning (DL) cluster-
dc.subject.keywordAuthorEconomic model predictive control (EMPC)-
dc.subject.keywordAuthorGPU power capping-
dc.subject.keywordAuthorLagrangian dual decomposition-
dc.subject.keywordAuthorLipschitz continuity-
Appears in Collection
EE-Journal Papers(저널논문)
Files in This Item
There are no files associated with this item.
This item is cited by other documents in WoS
⊙ Detail Information in WoSⓡ Click to see webofscience_button
⊙ Cited 5 items in WoS Click to see citing articles in records_button

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0