DSpace at KOASAS: Training Resilience with Persistent Memory Pooling using CXL Technology

DSpace at KOASAS

RIMS Collection RIMS Conference Papers

Training Resilience with Persistent Memory Pooling using CXL Technology

Cited 0 time in webofscience

Cited 0 time in scopus

Hit : 67
Download : 0

Export

Kwon, Miryeong / Jang, Junhyeok / Choi, Hanjin / Lee, Sangwon / Jung, Myoungsoo

Deep learning-based recommendation systems are resource-intensive and require large amounts of memory space to achieve high accuracy. To meet these demands, hyperscalers have scaled up their recommendation models to consume tens of terabytes of memory space. Additionally, these models must be fault-tolerant and trained for long periods without accuracy degradation. In this talk, we present TrainingCXL, an innovative solution that leverages CXL 3.0 to efficiently process large-scale RMs in disaggregated memory while ensuring training is failure-tolerant with low overhead. By integrating persistent memory (PMEM) and GPU as Type-2 devices in a cache-coherent domain, we enable direct access to PMEM without software intervention. TrainingCXL employs computing and checkpointing logic near the CXL controller to manage persistency actively and efficiently. To ensure fault tolerance, we use the unique characteristics of RMs to take checkpointing off the critical path of their training. We also employ an advanced checkpointing technique that relaxes the updating sequence of embeddings across training batches. The evaluation shows that TrainingCXL achieves significant performance improvements, including a 5.2x speedup and 72.6% energy savings compared to modern PMEM-based recommendation systems.

Publisher: IEEE

Issue Date: 2023-02-26

Language: English

Citation: Workshop for Heterogeneous and Composable Memory, HCM 2023

URI: http://hdl.handle.net/10203/311378

Appears in Collection: RIMS Conference Papers

Files in This Item: There are no files associated with this item.

Display Full Item Record

qr_code

트윗하기

KOASAS

Knowledge Service Development Team, KAIST 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea. T. 82-42-350-4493 Email. koasas@kaist.ac.kr
Copyright © 2016. Korea Advanced Institute of Science and Technology. All Rights Reserved.

KOASAS

KOASAS

Browse

Training Resilience with Persistent Memory Pooling using CXL Technology

KOASAS

Communities & Collections