Logical/Physical Topology-Aware Collective Communication in Deep Learning Training.

Cited 1 time in webofscience Cited 0 time in scopus
  • Hit : 33
  • Download : 0
Training is an important aspect of deep learning to enable network models to be deployed. To scale training, multiple GPUs are commonly used with data parallelism to exploit the additional GPU compute and memory capacity. However, one challenge in scalability is the collective communication between GPUs. In this work, we propose to accelerate the AllReduce collective. AllReduce communication is often based on a logical topology (e.g., ring or tree algorithms) that is mapped to a physical topology or the physical connectivity between the nodes. In this work, we propose a logical/physical topology-aware collective communication that we refer to as C-Cube architecture – Chaining Collective Communication with Computation. C-Cube exploits the opportunity to overlap or chain different phases of collective communication as well as forward computation in a tree algorithm AllReduce. We exploit the communication pattern in a logical tree topology to overlap the different phases of communication. Since ordering is maintained in the tree collective algorithm, we propose gradient queuing to enable chaining of communication with forward computation to accelerate overall performance while having no impact on training accuracy. We also exploit the physical topology characteristics to further improve the performance, including proposing detour connections for collective communication while leveraging the additional connectivity to enable a double-tree C-Cube implementation. We implement a C-Cube proof-of-concept on a real system (8-GPU NVIDIA DGX-1) and show C-Cube results in performance improvement in communication performance compared to non-overlapped tree algorithms as well as overall performance.
Publisher
IEEE Computer Society
Issue Date
2023-02-27
Language
English
Citation

29th Annual IEEE International Symposium on High Performance Computer Architecture, HPCA 2023, pp.56 - 68

ISSN
1530-0897
DOI
10.1109/HPCA56546.2023.10071117
URI
http://hdl.handle.net/10203/315770
Appears in Collection
EE-Conference Papers(학술회의논문)
Files in This Item
There are no files associated with this item.
This item is cited by other documents in WoS
⊙ Detail Information in WoSⓡ Click to see webofscience_button
⊙ Cited 1 items in WoS Click to see citing articles in records_button

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0