Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Cited 7 time in webofscience Cited 0 time in scopus
  • Hit : 241
  • Download : 0
Accelerating neural network training is critical in exploring design space of neural networks. Data parallelism is commonly used to accelerate training for Convolutional Neural Networks (CNN) where input hatch is distributed across the multiple workers; however, the communication time of weight. gradients can limit scalability for moderate hatch size. In this work, we propose multi-dimensional parallel training (MPT) of convolution layers by exploiting both data parallelism and intra-tile parallelism available in Winograd transformed convolution. Workers are organized across two dimensions one dimension exploiting intra-tile parallelism while the other dimension exploits data parallelism. MPT reduces the amount of communication necessary for weight gradients since weight gradients are only communicated within the data parallelism dimension. However, Winograd transform fundamentally requires more data accesses and the proposed MPT architecture also introduces a new type of communication which we refer to as tile transfer gather/scatter of Winograd domain feature maps (tiles). We propose a scalable near-data processing (NDP) architecture to minimize the cost of data accesses through 31) stacked memory while leveraging a memory-centric network organization to provide high-connectivity between the workers to accelerate tile transfer. To minimize tile gathering communication overhead, we exploit prediction of activation of spatial domain neurons in order to remove the communication of tiles that are transformed to non-activated neurons. We also propose dynamic clustering of the memory-centric network architecture that reconfigures the interconnect topology between the workers for each convolution layer to balance the communication required for weight gradients and tile transfer. Our evaluations show that the proposed MPT with NDP architecture accelerates training by up to 2.7x, 21x compared to data parallel training on the NDP architecture and a multi-GPU system, respectively.
Publisher
IEEE/ACM
Issue Date
2018-10-23
Language
English
Citation

The 51st Annual IEEE/ACM International Symposium on Microarchitecture, pp.682 - 695

DOI
10.1109/MICRO.2018.00061
URI
http://hdl.handle.net/10203/247303
Appears in Collection
EE-Conference Papers(학술회의논문)
Files in This Item
There are no files associated with this item.
This item is cited by other documents in WoS
⊙ Detail Information in WoSⓡ Click to see webofscience_button
⊙ Cited 7 items in WoS Click to see citing articles in records_button

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0