XPUTimer: Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand-Plus Scale

Authors

Weihao Cui*, Ji Zhang*, Han Zhao, Chao Liu, Wenhao Zhang, Jian Sha, Quan Chen, Bingsheng He, Minyi Guo.

Published

8 February 2025

Publication details

Working paper

Links