DSpace at KOASAS: Towards GPU-driven Code Execution for Distributed Deep Learning

DSpace at KOASAS

College of Engineering(공과대학)School of Electrical Engineering(전기및전자공학부)EE-Conference Papers(학술회의논문)

Towards GPU-driven Code Execution for Distributed Deep Learning

Cited 0 time in webofscience

Cited 0 time in scopus

Hit : 57
Download : 0

Export

DC Field	Value	Language
dc.contributor.author	Hwang, Changho	ko
dc.contributor.author	Park, Kyoung-Soo	ko
dc.contributor.author	Shu, Ran	ko
dc.contributor.author	Qu, Xinyuan	ko
dc.contributor.author	Cheng, Peng	ko
dc.contributor.author	Xiong, Yongqiang	ko
dc.date.accessioned	2022-11-18T03:04:10Z	-
dc.date.available	2022-11-18T03:04:10Z	-
dc.date.created	2022-07-15	-
dc.date.issued	2022-06-19	-
dc.identifier.citation	Machine Learning for Computer Architectgure and Systems (MLArchSys'22)	-
dc.identifier.uri	http://hdl.handle.net/10203/299943	-
dc.description.abstract	Modern state-of-the-art deep learning (DL) applications tend to scale out to a large number of parallel GPUs. Unfortunately, we observe that the collective communication overhead across GPUs is often the key limiting factor of performance for distributed DL. It under-utilizes the networking bandwidth by frequent transfers of small data chunks, which also incurs a substantial I/O overhead on GPU that interferes with computation on GPU. The root cause lies in the inefficiency of CPU-based communication event handling as well as the inability to control the GPU’s internal DMA engine with GPU threads. To address the problem, we propose a GPU-driven code execution system that leverages a GPU-controlled hardware DMA engine for I/O offloading. Our custom DMA engine pipelines multiple DMA requests to support efficient small data transfer while it eliminates the I/O overhead on GPU cores. Unlike existing GPU DMA engines initiated only by CPU, we let GPU threads to directly control DMA operations, which leads to a highly efficient system where GPUs drive their own execution flow and handle communication events autonomously without CPU intervention. Our prototype DMA engine achieves a line-rate from a message size as small as 8KB (3.87x better throughput) with only 4.32µs of communication latency (9.1x faster) while it incurs little interference with computation on GPU, achieving 1.82x higher all-reduce throughput in a real training workload	-
dc.language	English	-
dc.publisher	ACM/IEEE	-
dc.title	Towards GPU-driven Code Execution for Distributed Deep Learning	-
dc.type	Conference	-
dc.type.rims	CONF	-
dc.citation.publicationname	Machine Learning for Computer Architectgure and Systems (MLArchSys'22)	-
dc.identifier.conferencecountry	US	-
dc.identifier.conferencelocation	New York City	-
dc.contributor.localauthor	Park, Kyoung-Soo	-
dc.contributor.nonIdAuthor	Shu, Ran	-
dc.contributor.nonIdAuthor	Qu, Xinyuan	-
dc.contributor.nonIdAuthor	Cheng, Peng	-
dc.contributor.nonIdAuthor	Xiong, Yongqiang	-

Appears in Collection: EE-Conference Papers(학술회의논문)

Files in This Item: There are no files associated with this item.

Display Simple Item Record

qr_code

트윗하기

KOASAS

Knowledge Service Development Team, KAIST 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea. T. 82-42-350-4493 Email. koasas@kaist.ac.kr
Copyright © 2016. Korea Advanced Institute of Science and Technology. All Rights Reserved.

KOASAS

KOASAS

Browse

Towards GPU-driven Code Execution for Distributed Deep Learning

KOASAS

Communities & Collections