Publications | Weihao Cui

* indicates co-first author, † indicates corresponding author

Top-tier Mid-tier Recognized Preprint

2026

NSDI ’26

Flare: Anomaly diagnostics for divergent llm training in gpu clusters of thousand-plus scale

Weihao Cui, Ji Zhang, Han Zhao, Chao Liu, Wenhao Zhang, Jian Sha, Bingsheng He, Minyi Guo, and Quan Chen

In Proceedings of the 23rd USENIX Symposium on Networked Systems Design and Implementation, 2026
NSDI ’26

MuxTune: Efficient Multi-Task LLM Fine-Tuning in Multi-Tenant Datacenters via Spatial-Temporal Backbone Multiplexing

Chunyu Xue, Yi Pan, Weihao Cui, Quan Chen, Shulai Zhang, Bingsheng He, and Minyi Guo

In Proceedings of the 23rd USENIX Symposium on Networked Systems Design and Implementation, 2026
HPCA ’26

LEGO: Supporting LLM-enhanced Games with One Gaming GPU

Han Zhao^*, Weihao Cui^*, Zeshen Zhang, Wenhao Zhang, Jiangtong Li, Quan Chen, Pu Pang, Zijun Li, Zhenhua Han, Yuqing Yang, and 1 more author

In 2026 IEEE International Symposium on High Performance Computer Architecture, 2026

2025

arXiv

Optimizing SLO-oriented LLM Serving with PD-Multiplexing

Weihao Cui, Yukang Chen, Han Zhao, Ziyi Xu, Quan Chen, Xusheng Chen, Yangjie Zhou, Shixuan Sun, and Minyi Guo

arXiv preprint arXiv:2504.14489, 2025
arXiv

Efficient Function-as-a-Service for Large Language Models with TIDAL

Weihao Cui, Ziyi Xu, Han Zhao, Quan Chen, Zijun Li, Bingsheng He, and Minyi Guo

arXiv preprint arXiv:2503.06421, 2025
arXiv

Boosting Embodied AI Agents through Perception-Generation Disaggregation and Asynchronous Pipeline Execution

Shulai Zhang, Ao Xu, Quan Chen, Han Zhao, Weihao Cui, Ningxin Zheng, Haibin Lin, Xin Liu, and Minyi Guo

arXiv preprint arXiv:2509.09560, 2025
arXiv

Harli: SLO-Aware Co-location of LLM Inference and PEFT-based Finetuning on Model-as-a-Service Platforms

Ao Xu, Han Zhao, Weihao Cui, Quan Chen, Yukang Chen, Shulai Zhang, Shuang Chen, Jiemin Jiang, Zhibin Yu, and Minyi Guo

arXiv e-prints, 2025
ASPLOS ’25

Voyager: Input-Adaptive Algebraic Transformations for High-Performance Graph Neural Networks

Yangjie Zhou, Wenting Shen, Jingwen Leng, Shuwen Lu, Zihan Liu, Weihao Cui, Zhendong Zhang, Wencong Xiao, Baole Ai, Yong Li, and 6 more authors

In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2025
SC ’25

A Sample-Free Compilation Framework for Efficient Dynamic Tensor Computation

Yangjie Zhou, Honglin Zhu, Qian Qiu, Weihao Cui, Zihan Liu, Peng Chen, Mohamed Wahib, Cong Guo, Siyuan Feng, Jintao Meng, and 6 more authors

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025
MLSys ’25

Comet: Fine-grained computation-communication overlapping for mixture-of-experts

Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, and 2 more authors

In Proceedings of the 8th Annual Conference on Machine Learning and Systems (MLSys), 2025
EuroSys ’25

Improving GPU Sharing Performance through Adaptive Bubbleless Spatial-Temporal Sharing

Shulai Zhang, Quan Chen, Weihao Cui, Han Zhao, Chunyu Xue, Zhen Zheng, Wei Lin, and Minyi Guo

In Proceedings of the Twentieth European Conference on Computer Systems, 2025
ATC ’25

Efficient Performance-Aware GPU Sharing with Compatibility and Isolation through Kernel Space Interception

Shulai Zhang, Ao Xu, Quan Chen, Han Zhao, Weihao Cui, Zhen Wang, Yan Li, Limin Xiao, and Minyi Guo

In 2025 USENIX Annual Technical Conference, 2025
HPCA ’25

VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference

Zihan Liu, Xinhao Luo, Junxian Guo, Wentao Ni, Yangjie Zhou, Yue Guan, Cong Guo, Weihao Cui, Yu Feng, Minyi Guo, and 4 more authors

In 2025 IEEE International Symposium on High Performance Computer Architecture, 2025
TACO ’25

EDAS: Enabling Fast Data Loading for GPU Serverless Computing

Han Zhao^*, Weihao Cui^*, Quan Chen, Zijun Li, Zhenhua Han, Nan Wang, Yu Feng, Jieru Zhao, Chen Chen, Jingwen Leng, and 1 more author

ACM Transactions on Architecture and Code Optimization, 2025
TACO ’25

Taming Flexible Job Packing in Deep Learning Training Clusters

Pengyu Yang^*, Weihao Cui^*, Chunyu Xue, Han Zhao, Chen Chen, Quan Chen, Jing Yang, and Minyi Guo

ACM Transactions on Architecture and Code Optimization, 2025
TACO ’25

ARACHNE: Optimizing distributed parallel applications with reduced inter-process communication

Yifu He, Han Zhao, Weihao Cui, Shulai Zhang, Quan Chen, and Minyi Guo

ACM Transactions on Architecture and Code Optimization, 2025
TACO ’25

Ares: Fair and Efficient Scheduling of Deep Learning Jobs with Elastic Fair Queuing

Yifei Liu, Chen Chen, Qiang Wang, Yu Feng, Weihao Cui, Quan Chen, and Minyi Guo

ACM Transactions on Architecture and Code Optimization, 2025
APPT ’25

DACO: Unlocking Latent Dataflow Opportunities in Edge-Side SIMT Accelerators

Han Zhao, Yiying Xiang, Yu Liu, Xiaochun Ye, Deze Zeng, Jing Yang, Weihao Cui, Quan Chen, Jingwen Leng, and Minyi Guo

In International Symposium on Advanced Parallel Processing Technologies, 2025

2024

arXiv

A codesign of scheduling and parallelization for large model training in heterogeneous clusters

Chunyu Xue, Weihao Cui, Han Zhao, Quan Chen, Shulai Zhang, Pengyu Yang, Jing Yang, Shaobo Li, and Minyi Guo

arXiv preprint arXiv:2403.16125, 2024
TC ’24

Accelerating sparse dnns based on tiled gemm

Cong Guo, Fengchen Xue, Jingwen Leng, Yuxian Qiu, Yue Guan, Weihao Cui, Quan Chen, and Minyi Guo

IEEE Transactions on Computers, 2024
arXiv

The CAP principle for LLM serving: A survey of long-context large language model serving

Pai Zeng, Zhenyu Ning, Jieru Zhao, Weihao Cui, Mengwei Xu, Liwei Guo, Xusheng Chen, and Yizhou Shan

arXiv preprint arXiv:2405.11299, 2024
TC ’24

Adaptive Kernel Fusion for Improving the GPU Utilization While Ensuring QoS

Han Zhao, Junxiao Deng, Weihao Cui, Quan Chen, Youtao Zhang, Deze Zeng, and Minyi Guo

IEEE Transactions on Computers, 2024

2023

OSDI ’23

Optimizing dynamic neural networks with Brainstorm

Weihao Cui, Zhenhua Han, Lingji Ouyang, Yichuan Wang, Ningxin Zheng, Lingxiao Ma, Yuqing Yang, Fan Yang, Jilong Xue, Lili Qiu, and 4 more authors

In 17th USENIX Symposium on Operating Systems Design and Implementation, 2023
TC ’23

Improving Cluster Utilization Through Adaptive Resource Management for Deep Neural Network and CPU Jobs Colocation

Han Zhao, Weihao Cui, Quan Chen, Jingwen Leng, Deze Zeng, and Minyi Guo

IEEE Transactions on Computers, 2023
SoCC ’23

Maximizing the utilization of GPUs used by cloud gaming through adaptive co-location with combo

Binghao Chen, Han Zhao, Weihao Cui, Yifu He, Shulai Zhang, Quan Chen, Zijun Li, and Minyi Guo

In Proceedings of the 2023 ACM Symposium on Cloud Computing, 2023
ICPADS ’23

Microless: Cost-efficient hybrid deployment of microservices on iaas vms and serverless

Jiagan Cheng, Yilong Zhao, Zijun Li, Quan Chen, Weihao Cui, and Minyi Guo

In 2023 IEEE 29th International Conference on Parallel and Distributed Systems, 2023
CF ’23

AdaptGear: Accelerating GNN Training via Adaptive Subgraph-Level Kernels on GPUs

Yangjie Zhou, Yaoxu Song, Jingwen Leng, Zihan Liu, Weihao Cui, Zhendong Zhang, Cong Guo, Quan Chen, Li Li, and Minyi Guo

In Proceedings of the 20th ACM International Conference on Computing Frontiers, 2023

2022

ATC ’22

DVABatch: Diversity-aware Multi-Entry Multi-Exit batching for efficient processing of DNN services on GPUs

Weihao Cui, Han Zhao, Quan Chen, Hao Wei, Zirui Li, Deze Zeng, Chao Li, and Minyi Guo

In 2022 USENIX Annual Technical Conference, 2022
HPCA ’22

Tacker: Tensor-CUDA core kernel fusion for improving the GPU utilization while ensuring QoS

Han Zhao, Weihao Cui, Quan Chen, Youtao Zhang, Yanchao Lu, Chao Li, Jingwen Leng, and Minyi Guo

In 2022 IEEE International Symposium on High-Performance Computer Architecture, 2022
TC ’22

ISPA: Exploiting intra-SM parallelism in GPUs via fine-grained resource management

Han Zhao, Weihao Cui, Quan Chen, and Minyi Guo

IEEE Transactions on Computers, 2022
ICS ’22

PAME: precision-aware multi-exit DNN serving for reducing latencies of batched inferences

Shulai Zhang, Weihao Cui, Quan Chen, Zhengnian Zhang, Yue Guan, Jingwen Leng, Chao Li, and Minyi Guo

In Proceedings of the 36th ACM International Conference on Supercomputing, 2022

2021

TC ’21

Toward QoS-awareness and improved utilization of spatial multitasking GPUs

Wei Zhang, Quan Chen, Ningxin Zheng, Weihao Cui, Kaihua Fu, and Minyi Guo

IEEE Transactions on Computers, 2021
ICCD ’21

Exploiting intra-sm parallelism in gpus via persistent and elastic blocks

Han Zhao, Weihao Cui, Quan Chen, Jieru Zhao, Jingwen Leng, and Minyi Guo

In 2021 IEEE 39th International Conference on Computer Design, 2021
SC ’21

Enable simultaneous DNN services based on deterministic operator overlap and precise latency prediction

Weihao Cui, Han Zhao, Quan Chen, Ningxin Zheng, Jingwen Leng, Jieru Zhao, Zhuo Song, Tao Ma, Yong Yang, Chao Li, and 1 more author

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021

2020

TPDS ’20

E^2bird: Enhanced elastic batch for improving responsiveness and throughput of deep learning services

Weihao Cui, Quan Chen, Han Zhao, Mengze Wei, Xiaoxin Tang, and Minyi Guo

IEEE Transactions on Parallel and Distributed Systems, 2020
ICDCS ’20

CODA: Improving resource utilization by slimming and co-locating DNN and CPU jobs

Han Zhao, Weihao Cui, Quan Chen, Jingwen Leng, Kai Yu, Deze Zeng, Chao Li, and Minyi Guo

In 2020 IEEE 40th International Conference on Distributed Computing Systems, 2020

2019

ICCD ’19

Ebird: Elastic batch for improving responsiveness and throughput of deep learning services

Weihao Cui, Mengze Wei, Quan Chen, Xiaoxin Tang, Jingwen Leng, Li Li, and Mingyi Guo

In 2019 IEEE 37th International Conference on Computer Design, 2019
ICS ’19

Laius: Towards latency awareness and improved utilization of spatial multitasking accelerators in datacenters

Wei Zhang, Weihao Cui, Kaihua Fu, Quan Chen, Daniel Edward Mawhirter, Bo Wu, Chao Li, and Minyi Guo

In Proceedings of the ACM international conference on supercomputing, 2019