Deep learning-based recommendation systems are resource-intensive and require large amounts of memory space to achieve high accuracy. To meet these demands, hyperscalers have scaled up their recommendation models to consume tens of terabytes of memory space. Additionally, these models must be fault-tolerant and trained for long periods without accuracy degradation. In this talk, we present TrainingCXL, an innovative solution that leverages CXL 3.0 to efficiently process large-scale RMs in disaggregated memory while ensuring training is failure-tolerant with low overhead. By integrating persistent memory (PMEM) and GPU as Type-2 devices in a cache-coherent domain, we enable direct access to PMEM without software intervention. TrainingCXL employs computing and checkpointing logic near the CXL controller to manage persistency actively and efficiently. To ensure fault tolerance, we use the unique characteristics of RMs to take checkpointing off the critical path of their training. We also employ an advanced checkpointing technique that relaxes the updating sequence of embeddings across training batches. The evaluation shows that TrainingCXL achieves significant performance improvements, including a 5.2x speedup and 72.6% energy savings compared to modern PMEM-based recommendation systems.