SPARe: Stacked Parallelism with Adaptive Reordering for Fault-Tolerant LLM Pretraining Systems with 100k+ GPUs

Thu, 01 Jan 2026 00:00:00 +0000

SPARe studies fault-tolerant LLM pretraining at extreme scale, combining stacked parallelism with adaptive reordering to improve resilience and efficiency for 100k+ GPU systems.

Fault Tolerance | Zhonghao Chen

SPARe: Stacked Parallelism with Adaptive Reordering for Fault-Tolerant LLM Pretraining Systems with 100k+ GPUs