Flare: Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand-Plus Scale

Conference
Authors

Weihao Cui, Ji Zhang, Han Zhao, Chao Liu, Wenhao Zhang, Jian Sha, Bingsheng He, Minyi Guo, Quan Chen.

Published

4 May 2026

Publication details

NSDI

Links