Training Resilience with Persistent Memory Pooling using CXL Technology

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 53
  • Download : 0
Deep learning-based recommendation systems are resource-intensive and require large amounts of memory space to achieve high accuracy. To meet these demands, hyperscalers have scaled up their recommendation models to consume tens of terabytes of memory space. Additionally, these models must be fault-tolerant and trained for long periods without accuracy degradation. In this talk, we present TrainingCXL, an innovative solution that leverages CXL 3.0 to efficiently process large-scale RMs in disaggregated memory while ensuring training is failure-tolerant with low overhead. By integrating persistent memory (PMEM) and GPU as Type-2 devices in a cache-coherent domain, we enable direct access to PMEM without software intervention. TrainingCXL employs computing and checkpointing logic near the CXL controller to manage persistency actively and efficiently. To ensure fault tolerance, we use the unique characteristics of RMs to take checkpointing off the critical path of their training. We also employ an advanced checkpointing technique that relaxes the updating sequence of embeddings across training batches. The evaluation shows that TrainingCXL achieves significant performance improvements, including a 5.2x speedup and 72.6% energy savings compared to modern PMEM-based recommendation systems.
Publisher
IEEE
Issue Date
2023-02-26
Language
English
Citation

Workshop for Heterogeneous and Composable Memory, HCM 2023

URI
http://hdl.handle.net/10203/311378
Appears in Collection
RIMS Conference Papers
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0