DC Field | Value | Language |
---|---|---|
dc.contributor.author | Hwang, Changho | ko |
dc.contributor.author | Park, Kyoung-Soo | ko |
dc.contributor.author | Shu, Ran | ko |
dc.contributor.author | Qu, Xinyuan | ko |
dc.contributor.author | Cheng, Peng | ko |
dc.contributor.author | Xiong, Yongqiang | ko |
dc.date.accessioned | 2022-11-18T03:04:10Z | - |
dc.date.available | 2022-11-18T03:04:10Z | - |
dc.date.created | 2022-07-15 | - |
dc.date.issued | 2022-06-19 | - |
dc.identifier.citation | Machine Learning for Computer Architectgure and Systems (MLArchSys'22) | - |
dc.identifier.uri | http://hdl.handle.net/10203/299943 | - |
dc.description.abstract | Modern state-of-the-art deep learning (DL) applications tend to scale out to a large number of parallel GPUs. Unfortunately, we observe that the collective communication overhead across GPUs is often the key limiting factor of performance for distributed DL. It under-utilizes the networking bandwidth by frequent transfers of small data chunks, which also incurs a substantial I/O overhead on GPU that interferes with computation on GPU. The root cause lies in the inefficiency of CPU-based communication event handling as well as the inability to control the GPU’s internal DMA engine with GPU threads. To address the problem, we propose a GPU-driven code execution system that leverages a GPU-controlled hardware DMA engine for I/O offloading. Our custom DMA engine pipelines multiple DMA requests to support efficient small data transfer while it eliminates the I/O overhead on GPU cores. Unlike existing GPU DMA engines initiated only by CPU, we let GPU threads to directly control DMA operations, which leads to a highly efficient system where GPUs drive their own execution flow and handle communication events autonomously without CPU intervention. Our prototype DMA engine achieves a line-rate from a message size as small as 8KB (3.87x better throughput) with only 4.32µs of communication latency (9.1x faster) while it incurs little interference with computation on GPU, achieving 1.82x higher all-reduce throughput in a real training workload | - |
dc.language | English | - |
dc.publisher | ACM/IEEE | - |
dc.title | Towards GPU-driven Code Execution for Distributed Deep Learning | - |
dc.type | Conference | - |
dc.type.rims | CONF | - |
dc.citation.publicationname | Machine Learning for Computer Architectgure and Systems (MLArchSys'22) | - |
dc.identifier.conferencecountry | US | - |
dc.identifier.conferencelocation | New York City | - |
dc.contributor.localauthor | Park, Kyoung-Soo | - |
dc.contributor.nonIdAuthor | Shu, Ran | - |
dc.contributor.nonIdAuthor | Qu, Xinyuan | - |
dc.contributor.nonIdAuthor | Cheng, Peng | - |
dc.contributor.nonIdAuthor | Xiong, Yongqiang | - |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.