ARK: GPU-driven Code Execution for Distributed Deep Learning

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 44
  • Download : 0
DC FieldValueLanguage
dc.contributor.authorHwang, Changhoko
dc.contributor.authorPark, KyoungSooko
dc.contributor.authorShu, Ranko
dc.contributor.authorQu, Xinyuanko
dc.contributor.authorCheng, Pengko
dc.contributor.authorXiong, Yongqiangko
dc.date.accessioned2023-11-21T06:01:19Z-
dc.date.available2023-11-21T06:01:19Z-
dc.date.created2023-11-21-
dc.date.issued2023-04-17-
dc.identifier.citation20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023, pp.87 - 101-
dc.identifier.urihttp://hdl.handle.net/10203/314925-
dc.description.abstractModern state-of-the-art deep learning (DL) applications tend to scale out to a large number of parallel GPUs. Unfortunately, we observe that the collective communication overhead across GPUs is often the key limiting factor of performance for distributed DL. It under-utilizes the networking bandwidth by frequent transfers of small data chunks, which also incurs a substantial I/O overhead on GPU that interferes with computation on GPU. The root cause lies in the inefficiency of CPU-based communication event handling as well as the inability to control the GPU's internal DMA engine with GPU threads. To address the problem, we propose a GPU-driven code execution system that leverages a GPU-controlled hardware DMA engine for I/O offloading. Our custom DMA engine pipelines multiple DMA requests to support efficient small data transfer while it eliminates the I/O overhead on GPU cores. Unlike existing GPU DMA engines initiated only by CPU, we let GPU threads directly control DMA operations, which leads to a highly efficient system where GPUs drive their own execution flow and handle communication events autonomously without CPU intervention. Our prototype DMA engine achieves a line-rate from a message size as small as 8KB (3.9x better throughput) with only 4.3µs of communication latency (9.1x faster) while it incurs little interference with computation on GPU, achieving 1.8x higher all-reduce throughput in a real training workload.-
dc.languageEnglish-
dc.publisherUSENIX Association-
dc.titleARK: GPU-driven Code Execution for Distributed Deep Learning-
dc.typeConference-
dc.identifier.wosid001066630000006-
dc.identifier.scopusid2-s2.0-85159281407-
dc.type.rimsCONF-
dc.citation.beginningpage87-
dc.citation.endingpage101-
dc.citation.publicationname20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023-
dc.identifier.conferencecountryUS-
dc.identifier.conferencelocationBoston, MA-
dc.contributor.localauthorPark, KyoungSoo-
dc.contributor.nonIdAuthorShu, Ran-
dc.contributor.nonIdAuthorQu, Xinyuan-
dc.contributor.nonIdAuthorCheng, Peng-
dc.contributor.nonIdAuthorXiong, Yongqiang-
Appears in Collection
EE-Conference Papers(학술회의논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0