SPARe: Stacked Parallelism with Adaptive Reordering for Fault-Tolerant LLM Pretraining Systems with 100k+ GPUs

Jan 1, 2026·

Jin Lee

Zhonghao Chen

Xuhang He

Robert Underwood

Bogdan Nicolae

Franck Cappello

Xiaoyi Lu

Sheng Di

Zheng Zhang

· 1 min read

Cite Project

Type

Conference paper

Publication

In Proceedings of the 43rd International Conference on Machine Learning

SPARe studies fault-tolerant LLM pretraining at extreme scale, combining stacked parallelism with adaptive reordering to improve resilience and efficiency for 100k+ GPU systems.

Last updated on Jan 1, 2026

High-Performance Computing Large Language Models Fault Tolerance HPC-AI Convergence

Authors

Zhonghao Chen

Ph.D. Student in ECE

← FedMECA: Scalable Federated Learning via Memory-Efficient and Concurrent Aggregation Jan 1, 2026

When RDMA Goes Long-Haul: Characterization, Modeling, and Verbs-Level Emulation with Implications for Federated Learning Jan 1, 2026 →