Designing On-Chip Networks for Throughput Accelerators

Cited 1 time in webofscience Cited 0 time in scopus
  • Hit : 66
  • Download : 0
DC FieldValueLanguage
dc.contributor.authorBakhoda, Aliko
dc.contributor.authorKim, John Dongjunko
dc.contributor.authorAamodt, Tor M.ko
dc.date.accessioned2019-04-15T14:52:39Z-
dc.date.available2019-04-15T14:52:39Z-
dc.date.created2013-10-22-
dc.date.issued2013-09-
dc.identifier.citationACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, v.10, no.3-
dc.identifier.issn1544-3566-
dc.identifier.urihttp://hdl.handle.net/10203/254488-
dc.description.abstractAs the number of cores and threads in throughput accelerators such as Graphics Processing Units (GPU) increases, so does the importance of on-chip interconnection network design. This article explores throughput-effective Network-on-Chips (NoC) for future compute accelerators that employ Bulk-Synchronous Parallel (BSP) programming models such as CUDA and OpenCL. A hardware optimization is "throughput effective" if it improves parallel application-level performance per unit chip area. We evaluate performance of future looking workloads using detailed closed-loop simulations modeling compute nodes, NoC, and the DRAM memory system. We start from a mesh design with bisection bandwidth balanced to off-chip demand. Accelerator workloads tend to demand high off-chip memory bandwidth which results in a many-to-few traffic pattern when coupled with expected technology constraints of slow growth in pins-per-chip. Leveraging these observations we reduce NoC area by proposing a "checkerboard" NoC which alternates between conventional full routers and half routers with limited connectivity. Next, we show that increasing network terminal bandwidth at the nodes connected to DRAM controllers alleviates a significant fraction of the remaining imbalance resulting from the many-to-few traffic pattern. Furthermore, we propose a "double checkerboard inverted" NoC organization which takes advantage of channel slicing to reduce area while maintaining the performance improvements of the aforementioned techniques. This organization also has a simpler routing mechanism and improves average application throughput per unit area by 24.3%.-
dc.languageEnglish-
dc.publisherASSOC COMPUTING MACHINERY-
dc.subjectPROCESSOR-
dc.subjectCMOS-
dc.subjectROUTER-
dc.subjectMODEL-
dc.titleDesigning On-Chip Networks for Throughput Accelerators-
dc.typeArticle-
dc.identifier.wosid000324488500012-
dc.identifier.scopusid2-s2.0-84884521459-
dc.type.rimsART-
dc.citation.volume10-
dc.citation.issue3-
dc.citation.publicationnameACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION-
dc.identifier.doi10.1145/2512429-
dc.contributor.localauthorKim, John Dongjun-
dc.contributor.nonIdAuthorBakhoda, Ali-
dc.contributor.nonIdAuthorAamodt, Tor M.-
dc.type.journalArticleArticle-
dc.subject.keywordAuthorDesign-
dc.subject.keywordAuthorPerformance-
dc.subject.keywordAuthorBulk-synchronous parallel-
dc.subject.keywordAuthorthroughput accelerator-
dc.subject.keywordAuthorGPGPU-
dc.subject.keywordAuthorNoC-
dc.subject.keywordPlusMEMORY MODEL-
dc.subject.keywordPlusPROCESSOR-
dc.subject.keywordPlusCMOS-
dc.subject.keywordPlusROUTER-
dc.subject.keywordPlusCMPS-
dc.subject.keywordPlusFLOW-
Appears in Collection
EE-Journal Papers(저널논문)
Files in This Item
There are no files associated with this item.
This item is cited by other documents in WoS
⊙ Detail Information in WoSⓡ Click to see webofscience_button
⊙ Cited 1 items in WoS Click to see citing articles in records_button

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0