Distributed Systems | Zhonghao Chen

FedMECA: Scalable Federated Learning via Memory-Efficient and Concurrent Aggregation

Thu, 01 Jan 2026 00:00:00 +0000

FedMECA improves federated learning scalability by making aggregation more memory-efficient and concurrent, targeting complex FL workflows with large model and client counts.

HPC-AI Convergence

Wed, 01 Jan 2025 00:00:00 +0000

This project targets HPC-AI convergence for efficient large-scale machine learning, including scheduling, optimization, characterization, and fault-tolerant training systems. The project includes:

HPC-R1, a characterization of inference and distillation performance for large reasoning models on HPC-scale GPU clusters and interconnects.
SPARe, a fault-tolerant LLM pretraining system for 100k+ GPU scale using stacked parallelism and adaptive reordering.

Related publications:

Scalable, Resilient Federated Learning

Wed, 01 Jan 2025 00:00:00 +0000

SRFL targets scalable and resilient federated learning systems across heterogeneous compute and network environments. The project includes:

FedDES, a discrete-event based performance simulation framework for federated learning systems.
FedMECA, a memory-efficient and concurrent aggregation approach for scalable federated learning.
Long-haul RDMA studies for geo-distributed federated learning, including simulation, modeling, and real-world testbed validation.

Related publications: