DC Field | Value | Language |
---|---|---|
dc.contributor.author | Kwon, Miryeong | ko |
dc.contributor.author | Jang, Junhyeok | ko |
dc.contributor.author | Choi, Hanjin | ko |
dc.contributor.author | Lee, Sangwon | ko |
dc.contributor.author | Jung, Myoungsoo | ko |
dc.date.accessioned | 2023-05-10T13:00:20Z | - |
dc.date.available | 2023-05-10T13:00:20Z | - |
dc.date.created | 2023-04-05 | - |
dc.date.issued | 2023-02-26 | - |
dc.identifier.citation | Heterogeneous and Composable Memory Workshop at HPCA, 2023 | - |
dc.identifier.uri | http://hdl.handle.net/10203/306698 | - |
dc.description.abstract | Deep learning-based recommendation systems are resource-intensive and require large amounts of memory space to achieve high accuracy. To meet these demands, hyperscalers have scaled up their recommendation models to consume tens of terabytes of memory space. Additionally, these models must be fault-tolerant and trained for long periods without accuracy degradation. In this talk, we present TrainingCXL, an innovative solution that leverages CXL 3.0 to efficiently process large-scale RMs in disaggregated memory while ensuring training is failure-tolerant with low overhead. By integrating persistent memory (PMEM) and GPU as Type-2 devices in a cache-coherent domain, we enable direct access to PMEM without software intervention. TrainingCXL employs computing and checkpointing logic near the CXL controller to manage persistency actively and efficiently. To ensure fault tolerance, we use the unique characteristics of RMs to take checkpointing off the critical path of their training. We also employ an advanced checkpointing technique that relaxes the updating sequence of embeddings across training batches. The evaluation shows that TrainingCXL achieves significant performance improvements, including a 5.2x speedup and 72.6% energy savings compared to modern PMEM-based recommendation systems. | - |
dc.language | English | - |
dc.publisher | IEEE | - |
dc.title | Training Resilience with Persistent Memory Pooling using CXL Technology | - |
dc.type | Conference | - |
dc.type.rims | CONF | - |
dc.citation.publicationname | Heterogeneous and Composable Memory Workshop at HPCA, 2023 | - |
dc.identifier.conferencecountry | CN | - |
dc.identifier.conferencelocation | Montreal, QC | - |
dc.contributor.localauthor | Jung, Myoungsoo | - |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.