HPC-AI Convergence

Jan 1, 2025 · 1 min read

This project targets HPC-AI convergence for efficient large-scale machine learning, including scheduling, optimization, characterization, and fault-tolerant training systems. The project includes:

  • HPC-R1, a characterization of inference and distillation performance for large reasoning models on HPC-scale GPU clusters and interconnects.
  • SPARe, a fault-tolerant LLM pretraining system for 100k+ GPU scale using stacked parallelism and adaptive reordering.

Related publications: