DSpace at KOASAS: Failure Tolerant Training with Persistent Memory Disaggregation over CXL

DSpace at KOASAS

College of Engineering(공과대학)School of Electrical Engineering(전기및전자공학부)EE-Journal Papers(저널논문)

Failure Tolerant Training with Persistent Memory Disaggregation over CXL

Cited 1 time in

Cited 0 time in

Hit : 116
Download : 0

Export

Kwon, Miryeong / Jang, Junhyeok / Choi, Hanjin / Lee, Sangwon / Jung, Myoungsoo researcher

This article proposes TrainingCXL that can efficiently process large-scale recommendation datasets in the pool of disaggregated memory while making training fault tolerant with low overhead. To this end, we integrate persistent memory (PMEM) and graphics processing unit (GPU) into a cache-coherent domain as type 2. Enabling Compute Express Link (CXL) allows PMEM to be directly placed in GPU's memory hierarchy, such that GPU can access PMEM without software intervention. TrainingCXL introduces computing and checkpointing logic near the CXL controller, thereby training data and managing persistency in an active manner. Considering PMEM's vulnerability, we utilize the unique characteristics of recommendation models and take the checkpointing overhead off the critical path of their training. Finally, TrainingCXL employs an advanced checkpointing technique that relaxes the updating sequence of model parameters and embeddings across training batches. The evaluation shows that TrainingCXL achieves 5.2ÃÂ - training performance improvement and 76% energy savings, compared to the modern PMEM-based recommendation systems. © 1981-2012 IEEE.

Publisher: IEEE COMPUTER SOC

Issue Date: 2023-03

Language: English

Article Type: Article

Citation: IEEE MICRO, v.43, no.2, pp.66 - 75

ISSN: 0272-1732

DOI: 10.1109/MM.2023.3237548

URI: http://hdl.handle.net/10203/305853

Appears in Collection: EE-Journal Papers(저널논문)

Files in This Item: There are no files associated with this item.

This item is cited by other documents in WoS

⊙ Detail Information in WoSⓡ	Click to see
⊙ Cited 1 items in WoS	Click to see citing articles in

Display Full Item Record

qr_code

트윗하기

KOASAS

Knowledge Service Development Team, KAIST 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea. T. 82-42-350-4493 Email. koasas@kaist.ac.kr
Copyright © 2016. Korea Advanced Institute of Science and Technology. All Rights Reserved.

KOASAS

KOASAS

Browse

Failure Tolerant Training with Persistent Memory Disaggregation over CXL

This item is cited by other documents in WoS

KOASAS

Communities & Collections