Fault Tolerance

Fault-tolerant LLM pretraining system for 100k+ GPU scale using stacked parallelism and adaptive reordering.

Jan 1, 2026