SPARe: Stacked Parallelism with Adaptive Reordering for Fault-Tolerant LLM Pretraining Systems with 100k+ GPUs

Jan 1, 2026·
Jin Lee
Zhonghao Chen
Zhonghao Chen
,
Xuhang He
,
Robert Underwood
,
Bogdan Nicolae
,
Franck Cappello
,
Xiaoyi Lu
,
Sheng Di
,
Zheng Zhang
· 1 min read
Type
Publication
In Proceedings of the 43rd International Conference on Machine Learning

SPARe studies fault-tolerant LLM pretraining at extreme scale, combining stacked parallelism with adaptive reordering to improve resilience and efficiency for 100k+ GPU systems.