Enabling Artificial Intelligence Supercomputers With Domain-Specific Networks

Cited 1 time in webofscience Cited 0 time in scopus
  • Hit : 10
  • Download : 0
Systems designed for artificial intelligence (AI) training and inference exhibit characteristics of both capacity and capability systems that require both tight coupling and strong scaling for model parallelism as well as weak scaling for data parallelism in distributed systems. In addition, managing enormous, 100 billion-parameter language models and trillion-token datasets introduces formidable computational challenges for today's supercomputing infrastructure. Communication and computation are two intertwined aspects of parallel computing, including AI domain-specific supercomputers, and this article explores the vital role of interconnection networks in large-scale systems. This work argues how domain-specific networks are a critical enabling technology necessary for AI supercomputers. In particular, we advocate for flexible, low-latency interconnects capable of delivering high throughput across massive scales with tens of thousands of endpoints. Additionally, we stress the importance of reliability and resilience in handling long-duration training workloads and the demanding inference needs of domain-specific workloads.
Publisher
IEEE COMPUTER SOC
Issue Date
2024-03
Language
English
Article Type
Article
Citation

IEEE MICRO, v.44, no.2, pp.41 - 49

ISSN
0272-1732
DOI
10.1109/MM.2023.3330079
URI
http://hdl.handle.net/10203/322532
Appears in Collection
EE-Journal Papers(저널논문)
Files in This Item
There are no files associated with this item.
This item is cited by other documents in WoS
⊙ Detail Information in WoSⓡ Click to see webofscience_button
⊙ Cited 1 items in WoS Click to see citing articles in records_button

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0