Projects

Scalable, Resilient Federated Learning

SRFL targets scalable and resilient federated learning systems across heterogeneous compute and network environments. The project includes: FedDES, a discrete-event based performance simulation framework for federated learning systems. FedMECA, a memory-efficient and concurrent aggregation approach for scalable federated learning. Long-haul RDMA studies for geo-distributed federated learning, including simulation, modeling, and real-world testbed validation.

Jan 1, 2025

Long-Haul RDMA

This project investigates long-haul RDMA for geo-distributed machine learning systems. The project includes: Characterization, modeling, and verbs-level emulation of long-haul RDMA behavior. Evaluation of whether long-haul RDMA can improve geo-distributed federated learning, including simulation and validation on a real-world testbed.

Jan 1, 2025

HPC-AI Convergence

This project targets HPC-AI convergence for efficient large-scale machine learning, including scheduling, optimization, characterization, and fault-tolerant training systems. The project includes: HPC-R1, a characterization of inference and distillation performance for large reasoning models on HPC-scale GPU clusters and interconnects. SPARe, a fault-tolerant LLM pretraining system for 100k+ GPU scale using stacked parallelism and adaptive reordering.

Jan 1, 2025

Edge–Cloud Scheduling for Neuro-Electrophysiological Signals

Analysis and applications of deep learning techniques for EEG-based brain-computer interfaces (BCI). Algorithm and system design for deep learning based edge–cloud scheduling targeting neuro-electrophysiological signal workloads.

Jun 1, 2023