DC Field | Value | Language |
---|---|---|
dc.contributor.author | Bakhoda, Ali | ko |
dc.contributor.author | Kim, John Dongjun | ko |
dc.contributor.author | Aamodt, Tor M. | ko |
dc.date.accessioned | 2019-04-15T14:52:39Z | - |
dc.date.available | 2019-04-15T14:52:39Z | - |
dc.date.created | 2013-10-22 | - |
dc.date.issued | 2013-09 | - |
dc.identifier.citation | ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, v.10, no.3 | - |
dc.identifier.issn | 1544-3566 | - |
dc.identifier.uri | http://hdl.handle.net/10203/254488 | - |
dc.description.abstract | As the number of cores and threads in throughput accelerators such as Graphics Processing Units (GPU) increases, so does the importance of on-chip interconnection network design. This article explores throughput-effective Network-on-Chips (NoC) for future compute accelerators that employ Bulk-Synchronous Parallel (BSP) programming models such as CUDA and OpenCL. A hardware optimization is "throughput effective" if it improves parallel application-level performance per unit chip area. We evaluate performance of future looking workloads using detailed closed-loop simulations modeling compute nodes, NoC, and the DRAM memory system. We start from a mesh design with bisection bandwidth balanced to off-chip demand. Accelerator workloads tend to demand high off-chip memory bandwidth which results in a many-to-few traffic pattern when coupled with expected technology constraints of slow growth in pins-per-chip. Leveraging these observations we reduce NoC area by proposing a "checkerboard" NoC which alternates between conventional full routers and half routers with limited connectivity. Next, we show that increasing network terminal bandwidth at the nodes connected to DRAM controllers alleviates a significant fraction of the remaining imbalance resulting from the many-to-few traffic pattern. Furthermore, we propose a "double checkerboard inverted" NoC organization which takes advantage of channel slicing to reduce area while maintaining the performance improvements of the aforementioned techniques. This organization also has a simpler routing mechanism and improves average application throughput per unit area by 24.3%. | - |
dc.language | English | - |
dc.publisher | ASSOC COMPUTING MACHINERY | - |
dc.subject | PROCESSOR | - |
dc.subject | CMOS | - |
dc.subject | ROUTER | - |
dc.subject | MODEL | - |
dc.title | Designing On-Chip Networks for Throughput Accelerators | - |
dc.type | Article | - |
dc.identifier.wosid | 000324488500012 | - |
dc.identifier.scopusid | 2-s2.0-84884521459 | - |
dc.type.rims | ART | - |
dc.citation.volume | 10 | - |
dc.citation.issue | 3 | - |
dc.citation.publicationname | ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION | - |
dc.identifier.doi | 10.1145/2512429 | - |
dc.contributor.localauthor | Kim, John Dongjun | - |
dc.contributor.nonIdAuthor | Bakhoda, Ali | - |
dc.contributor.nonIdAuthor | Aamodt, Tor M. | - |
dc.type.journalArticle | Article | - |
dc.subject.keywordAuthor | Design | - |
dc.subject.keywordAuthor | Performance | - |
dc.subject.keywordAuthor | Bulk-synchronous parallel | - |
dc.subject.keywordAuthor | throughput accelerator | - |
dc.subject.keywordAuthor | GPGPU | - |
dc.subject.keywordAuthor | NoC | - |
dc.subject.keywordPlus | MEMORY MODEL | - |
dc.subject.keywordPlus | PROCESSOR | - |
dc.subject.keywordPlus | CMOS | - |
dc.subject.keywordPlus | ROUTER | - |
dc.subject.keywordPlus | CMPS | - |
dc.subject.keywordPlus | FLOW | - |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.