Large Language Models | Zhonghao Chen

SPARe: Stacked Parallelism with Adaptive Reordering for Fault-Tolerant LLM Pretraining Systems with 100k+ GPUs

Thu, 01 Jan 2026 00:00:00 +0000

SPARe studies fault-tolerant LLM pretraining at extreme scale, combining stacked parallelism with adaptive reordering to improve resilience and efficiency for 100k+ GPU systems.

HPC-AI Convergence

Wed, 01 Jan 2025 00:00:00 +0000

This project targets HPC-AI convergence for efficient large-scale machine learning, including scheduling, optimization, characterization, and fault-tolerant training systems. The project includes:

HPC-R1, a characterization of inference and distillation performance for large reasoning models on HPC-scale GPU clusters and interconnects.
SPARe, a fault-tolerant LLM pretraining system for 100k+ GPU scale using stacked parallelism and adaptive reordering.

Related publications:

HPC-R1: Characterizing R1-like Large Reasoning Models on HPC

Wed, 01 Jan 2025 00:00:00 +0000

HPC-R1 characterizes inference and distillation performance of R1-like reasoning models on HPC platforms, identifying system bottlenecks and scalable deployment strategies.