SPARe: Stacked Parallelism with Adaptive Reordering for Fault-Tolerant LLM Pretraining Systems with 100k+ GPUs
Jan 1, 2026·
,,,,,,,·
1 min read
Jin Lee
Zhonghao Chen
Xuhang He
Robert Underwood
Bogdan Nicolae
Franck Cappello
Xiaoyi Lu
Sheng Di
Zheng Zhang
Type
Publication
In Proceedings of the 43rd International Conference on Machine Learning
SPARe studies fault-tolerant LLM pretraining at extreme scale, combining stacked parallelism with adaptive reordering to improve resilience and efficiency for 100k+ GPU systems.