DSpace at KOASAS: ARK: GPU-driven Code Execution for Distributed Deep Learning

DSpace at KOASAS

College of Engineering(공과대학)School of Electrical Engineering(전기및전자공학부)EE-Conference Papers(학술회의논문)

ARK: GPU-driven Code Execution for Distributed Deep Learning

Cited 0 time in webofscience

Cited 0 time in

Hit : 44
Download : 0

Export

DC Field	Value	Language
dc.contributor.author	Hwang, Changho	ko
dc.contributor.author	Park, KyoungSoo	ko
dc.contributor.author	Shu, Ran	ko
dc.contributor.author	Qu, Xinyuan	ko
dc.contributor.author	Cheng, Peng	ko
dc.contributor.author	Xiong, Yongqiang	ko
dc.date.accessioned	2023-11-21T06:01:19Z	-
dc.date.available	2023-11-21T06:01:19Z	-
dc.date.created	2023-11-21	-
dc.date.issued	2023-04-17	-
dc.identifier.citation	20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023, pp.87 - 101	-
dc.identifier.uri	http://hdl.handle.net/10203/314925	-
dc.description.abstract	Modern state-of-the-art deep learning (DL) applications tend to scale out to a large number of parallel GPUs. Unfortunately, we observe that the collective communication overhead across GPUs is often the key limiting factor of performance for distributed DL. It under-utilizes the networking bandwidth by frequent transfers of small data chunks, which also incurs a substantial I/O overhead on GPU that interferes with computation on GPU. The root cause lies in the inefficiency of CPU-based communication event handling as well as the inability to control the GPU's internal DMA engine with GPU threads. To address the problem, we propose a GPU-driven code execution system that leverages a GPU-controlled hardware DMA engine for I/O offloading. Our custom DMA engine pipelines multiple DMA requests to support efficient small data transfer while it eliminates the I/O overhead on GPU cores. Unlike existing GPU DMA engines initiated only by CPU, we let GPU threads directly control DMA operations, which leads to a highly efficient system where GPUs drive their own execution flow and handle communication events autonomously without CPU intervention. Our prototype DMA engine achieves a line-rate from a message size as small as 8KB (3.9x better throughput) with only 4.3µs of communication latency (9.1x faster) while it incurs little interference with computation on GPU, achieving 1.8x higher all-reduce throughput in a real training workload.	-
dc.language	English	-
dc.publisher	USENIX Association	-
dc.title	ARK: GPU-driven Code Execution for Distributed Deep Learning	-
dc.type	Conference	-
dc.identifier.wosid	001066630000006	-
dc.identifier.scopusid	2-s2.0-85159281407	-
dc.type.rims	CONF	-
dc.citation.beginningpage	87	-
dc.citation.endingpage	101	-
dc.citation.publicationname	20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023	-
dc.identifier.conferencecountry	US	-
dc.identifier.conferencelocation	Boston, MA	-
dc.contributor.localauthor	Park, KyoungSoo	-
dc.contributor.nonIdAuthor	Shu, Ran	-
dc.contributor.nonIdAuthor	Qu, Xinyuan	-
dc.contributor.nonIdAuthor	Cheng, Peng	-
dc.contributor.nonIdAuthor	Xiong, Yongqiang	-

Appears in Collection: EE-Conference Papers(학술회의논문)

Files in This Item: There are no files associated with this item.

Display Simple Item Record

qr_code

트윗하기

KOASAS

Knowledge Service Development Team, KAIST 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea. T. 82-42-350-4493 Email. koasas@kaist.ac.kr
Copyright © 2016. Korea Advanced Institute of Science and Technology. All Rights Reserved.

KOASAS

KOASAS

Browse

ARK: GPU-driven Code Execution for Distributed Deep Learning

KOASAS

Communities & Collections