Publications

* indicates co-first author, † indicates corresponding author

Top-tier   Mid-tier   Recognized   Preprint

2026

  1. NSDI ’26
    Flare: Anomaly diagnostics for divergent llm training in gpu clusters of thousand-plus scale
    Weihao Cui, Ji Zhang, Han Zhao, Chao Liu, Wenhao Zhang, Jian Sha, Bingsheng He, Minyi Guo, and Quan Chen
    In Proceedings of the 23rd USENIX Symposium on Networked Systems Design and Implementation, 2026
  2. NSDI ’26
    MuxTune: Efficient Multi-Task LLM Fine-Tuning in Multi-Tenant Datacenters via Spatial-Temporal Backbone Multiplexing
    Chunyu Xue, Yi Pan, Weihao Cui, Quan Chen, Shulai Zhang, Bingsheng He, and Minyi Guo
    In Proceedings of the 23rd USENIX Symposium on Networked Systems Design and Implementation, 2026
  3. HPCA ’26
    LEGO: Supporting LLM-enhanced Games with One Gaming GPU
    Han Zhao*, Weihao Cui*, Zeshen Zhang, Wenhao Zhang, Jiangtong Li, Quan Chen, Pu Pang, Zijun Li, Zhenhua Han, Yuqing Yang, and 1 more author
    In 2026 IEEE International Symposium on High Performance Computer Architecture, 2026

2025

  1. arXiv
    Optimizing SLO-oriented LLM Serving with PD-Multiplexing
    Weihao Cui, Yukang Chen, Han Zhao, Ziyi Xu, Quan Chen, Xusheng Chen, Yangjie Zhou, Shixuan Sun, and Minyi Guo
    arXiv preprint arXiv:2504.14489, 2025
  2. arXiv
    Efficient Function-as-a-Service for Large Language Models with TIDAL
    Weihao Cui, Ziyi Xu, Han Zhao, Quan Chen, Zijun Li, Bingsheng He, and Minyi Guo
    arXiv preprint arXiv:2503.06421, 2025
  3. arXiv
    Boosting Embodied AI Agents through Perception-Generation Disaggregation and Asynchronous Pipeline Execution
    Shulai Zhang, Ao Xu, Quan Chen, Han Zhao, Weihao Cui, Ningxin Zheng, Haibin Lin, Xin Liu, and Minyi Guo
    arXiv preprint arXiv:2509.09560, 2025
  4. arXiv
    Harli: SLO-Aware Co-location of LLM Inference and PEFT-based Finetuning on Model-as-a-Service Platforms
    Ao Xu, Han Zhao, Weihao Cui, Quan Chen, Yukang Chen, Shulai Zhang, Shuang Chen, Jiemin Jiang, Zhibin Yu, and Minyi Guo
    arXiv e-prints, 2025
  5. ASPLOS ’25
    Voyager: Input-Adaptive Algebraic Transformations for High-Performance Graph Neural Networks
    Yangjie Zhou, Wenting Shen, Jingwen Leng, Shuwen Lu, Zihan Liu, Weihao Cui, Zhendong Zhang, Wencong Xiao, Baole Ai, Yong Li, and 6 more authors
    In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2025
  6. SC ’25
    A Sample-Free Compilation Framework for Efficient Dynamic Tensor Computation
    Yangjie Zhou, Honglin Zhu, Qian Qiu, Weihao Cui, Zihan Liu, Peng Chen, Mohamed Wahib, Cong Guo, Siyuan Feng, Jintao Meng, and 6 more authors
    In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025
  7. MLSys ’25
    Comet: Fine-grained computation-communication overlapping for mixture-of-experts
    Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, and 2 more authors
    In Proceedings of the 8th Annual Conference on Machine Learning and Systems (MLSys), 2025
  8. EuroSys ’25
    Improving GPU Sharing Performance through Adaptive Bubbleless Spatial-Temporal Sharing
    Shulai Zhang, Quan Chen, Weihao Cui, Han Zhao, Chunyu Xue, Zhen Zheng, Wei Lin, and Minyi Guo
    In Proceedings of the Twentieth European Conference on Computer Systems, 2025
  9. ATC ’25
    Efficient Performance-Aware GPU Sharing with Compatibility and Isolation through Kernel Space Interception
    Shulai Zhang, Ao Xu, Quan Chen, Han Zhao, Weihao Cui, Zhen Wang, Yan Li, Limin Xiao, and Minyi Guo
    In 2025 USENIX Annual Technical Conference, 2025
  10. HPCA ’25
    VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference
    Zihan Liu, Xinhao Luo, Junxian Guo, Wentao Ni, Yangjie Zhou, Yue Guan, Cong Guo, Weihao Cui, Yu Feng, Minyi Guo, and 4 more authors
    In 2025 IEEE International Symposium on High Performance Computer Architecture, 2025
  11. TACO ’25
    EDAS: Enabling Fast Data Loading for GPU Serverless Computing
    Han Zhao*, Weihao Cui*, Quan Chen, Zijun Li, Zhenhua Han, Nan Wang, Yu Feng, Jieru Zhao, Chen Chen, Jingwen Leng, and 1 more author
    ACM Transactions on Architecture and Code Optimization, 2025
  12. TACO ’25
    Taming Flexible Job Packing in Deep Learning Training Clusters
    Pengyu Yang*, Weihao Cui*, Chunyu Xue, Han Zhao, Chen Chen, Quan Chen, Jing Yang, and Minyi Guo
    ACM Transactions on Architecture and Code Optimization, 2025
  13. TACO ’25
    ARACHNE: Optimizing distributed parallel applications with reduced inter-process communication
    Yifu He, Han Zhao, Weihao Cui, Shulai Zhang, Quan Chen, and Minyi Guo
    ACM Transactions on Architecture and Code Optimization, 2025
  14. TACO ’25
    Ares: Fair and Efficient Scheduling of Deep Learning Jobs with Elastic Fair Queuing
    Yifei Liu, Chen Chen, Qiang Wang, Yu Feng, Weihao Cui, Quan Chen, and Minyi Guo
    ACM Transactions on Architecture and Code Optimization, 2025
  15. APPT ’25
    DACO: Unlocking Latent Dataflow Opportunities in Edge-Side SIMT Accelerators
    Han Zhao, Yiying Xiang, Yu Liu, Xiaochun Ye, Deze Zeng, Jing Yang, Weihao Cui, Quan Chen, Jingwen Leng, and Minyi Guo
    In International Symposium on Advanced Parallel Processing Technologies, 2025

2024

  1. arXiv
    A codesign of scheduling and parallelization for large model training in heterogeneous clusters
    Chunyu Xue, Weihao Cui, Han Zhao, Quan Chen, Shulai Zhang, Pengyu Yang, Jing Yang, Shaobo Li, and Minyi Guo
    arXiv preprint arXiv:2403.16125, 2024
  2. TC ’24
    Accelerating sparse dnns based on tiled gemm
    Cong Guo, Fengchen Xue, Jingwen Leng, Yuxian Qiu, Yue Guan, Weihao Cui, Quan Chen, and Minyi Guo
    IEEE Transactions on Computers, 2024
  3. arXiv
    The CAP principle for LLM serving: A survey of long-context large language model serving
    Pai Zeng, Zhenyu Ning, Jieru Zhao, Weihao Cui, Mengwei Xu, Liwei Guo, Xusheng Chen, and Yizhou Shan
    arXiv preprint arXiv:2405.11299, 2024
  4. TC ’24
    Adaptive Kernel Fusion for Improving the GPU Utilization While Ensuring QoS
    Han Zhao, Junxiao Deng, Weihao Cui, Quan Chen, Youtao Zhang, Deze Zeng, and Minyi Guo
    IEEE Transactions on Computers, 2024

2023

  1. OSDI ’23
    Optimizing dynamic neural networks with Brainstorm
    Weihao Cui, Zhenhua Han, Lingji Ouyang, Yichuan Wang, Ningxin Zheng, Lingxiao Ma, Yuqing Yang, Fan Yang, Jilong Xue, Lili Qiu, and 4 more authors
    In 17th USENIX Symposium on Operating Systems Design and Implementation, 2023
  2. TC ’23
    Improving Cluster Utilization Through Adaptive Resource Management for Deep Neural Network and CPU Jobs Colocation
    Han Zhao, Weihao Cui, Quan Chen, Jingwen Leng, Deze Zeng, and Minyi Guo
    IEEE Transactions on Computers, 2023
  3. SoCC ’23
    Maximizing the utilization of GPUs used by cloud gaming through adaptive co-location with combo
    Binghao Chen, Han Zhao, Weihao Cui, Yifu He, Shulai Zhang, Quan Chen, Zijun Li, and Minyi Guo
    In Proceedings of the 2023 ACM Symposium on Cloud Computing, 2023
  4. ICPADS ’23
    Microless: Cost-efficient hybrid deployment of microservices on iaas vms and serverless
    Jiagan Cheng, Yilong Zhao, Zijun Li, Quan Chen, Weihao Cui, and Minyi Guo
    In 2023 IEEE 29th International Conference on Parallel and Distributed Systems, 2023
  5. CF ’23
    AdaptGear: Accelerating GNN Training via Adaptive Subgraph-Level Kernels on GPUs
    Yangjie Zhou, Yaoxu Song, Jingwen Leng, Zihan Liu, Weihao Cui, Zhendong Zhang, Cong Guo, Quan Chen, Li Li, and Minyi Guo
    In Proceedings of the 20th ACM International Conference on Computing Frontiers, 2023

2022

  1. ATC ’22
    DVABatch: Diversity-aware Multi-Entry Multi-Exit batching for efficient processing of DNN services on GPUs
    Weihao Cui, Han Zhao, Quan Chen, Hao Wei, Zirui Li, Deze Zeng, Chao Li, and Minyi Guo
    In 2022 USENIX Annual Technical Conference, 2022
  2. HPCA ’22
    Tacker: Tensor-CUDA core kernel fusion for improving the GPU utilization while ensuring QoS
    Han Zhao, Weihao Cui, Quan Chen, Youtao Zhang, Yanchao Lu, Chao Li, Jingwen Leng, and Minyi Guo
    In 2022 IEEE International Symposium on High-Performance Computer Architecture, 2022
  3. TC ’22
    ISPA: Exploiting intra-SM parallelism in GPUs via fine-grained resource management
    Han Zhao, Weihao Cui, Quan Chen, and Minyi Guo
    IEEE Transactions on Computers, 2022
  4. ICS ’22
    PAME: precision-aware multi-exit DNN serving for reducing latencies of batched inferences
    Shulai Zhang, Weihao Cui, Quan Chen, Zhengnian Zhang, Yue Guan, Jingwen Leng, Chao Li, and Minyi Guo
    In Proceedings of the 36th ACM International Conference on Supercomputing, 2022

2021

  1. TC ’21
    Toward QoS-awareness and improved utilization of spatial multitasking GPUs
    Wei Zhang, Quan Chen, Ningxin Zheng, Weihao Cui, Kaihua Fu, and Minyi Guo
    IEEE Transactions on Computers, 2021
  2. ICCD ’21
    Exploiting intra-sm parallelism in gpus via persistent and elastic blocks
    Han Zhao, Weihao Cui, Quan Chen, Jieru Zhao, Jingwen Leng, and Minyi Guo
    In 2021 IEEE 39th International Conference on Computer Design, 2021
  3. SC ’21
    Enable simultaneous DNN services based on deterministic operator overlap and precise latency prediction
    Weihao Cui, Han Zhao, Quan Chen, Ningxin Zheng, Jingwen Leng, Jieru Zhao, Zhuo Song, Tao Ma, Yong Yang, Chao Li, and 1 more author
    In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021

2020

  1. TPDS ’20
    E^2bird: Enhanced elastic batch for improving responsiveness and throughput of deep learning services
    Weihao Cui, Quan Chen, Han Zhao, Mengze Wei, Xiaoxin Tang, and Minyi Guo
    IEEE Transactions on Parallel and Distributed Systems, 2020
  2. ICDCS ’20
    CODA: Improving resource utilization by slimming and co-locating DNN and CPU jobs
    Han Zhao, Weihao Cui, Quan Chen, Jingwen Leng, Kai Yu, Deze Zeng, Chao Li, and Minyi Guo
    In 2020 IEEE 40th International Conference on Distributed Computing Systems, 2020

2019

  1. ICCD ’19
    Ebird: Elastic batch for improving responsiveness and throughput of deep learning services
    Weihao Cui, Mengze Wei, Quan Chen, Xiaoxin Tang, Jingwen Leng, Li Li, and Mingyi Guo
    In 2019 IEEE 37th International Conference on Computer Design, 2019
  2. ICS ’19
    Laius: Towards latency awareness and improved utilization of spatial multitasking accelerators in datacenters
    Wei Zhang, Weihao Cui, Kaihua Fu, Quan Chen, Daniel Edward Mawhirter, Bo Wu, Chao Li, and Minyi Guo
    In Proceedings of the ACM international conference on supercomputing, 2019