* indicates co-first author, † indicates corresponding author
Full List of Papers
Preprint Papers
- Optimizing SLO-oriented LLM Serving with PD-Multiplexing . arXiv
- Efficient Function-as-a-Service for Large Language Models with TIDAL . arXiv
- XPUTimer: Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand-Plus Scale . arXiv
- Vortex: Efficient Sample-Free Dynamic Tensor Program Optimization via Hardware-aware Strategy Space Hierarchization . arXiv
- The CAP Principle for LLM Serving . arXiv
- A Codesign of Scheduling and Parallelization for Large Model Training in Heterogeneous Clusters . arXiv
No matching items
Conference Papers
- Voyager: Input-Adaptive Algebraic Transformations for High-Performance Graph Neural Networks . ( ASPLOS 2026 ). CCF-A
- Efficient Performance-Aware GPU Sharing with Compatibility and Isolation through Kernel Space Interception . ( ATC 2025 ). CCF-A
- Improving GPU Sharing Performance through Adaptive Bubbleless Spatial-Temporal Sharing . ( EuroSys 2025 ). pdf CCF-A
- VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference . ( HPCA 2025 ). CCF-A
- Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts . ( MLSys 2025 ).
- Microless: Cost-Efficient Hybrid Deployment of Microservices on IaaS VMs and Serverless . ( ICPADS 2023 ). pdf
- Maximizing the Utilization of GPUs Used by Cloud Gaming through Adaptive Co-location with Combo . ( SoCC 2023 ). pdf CCF-B
- Optimizing Dynamic Neural Networks with Brainstorm . ( OSDI 2023 ). pdf CCF-A
- AdaptGear: Accelerating GNN Training via Adaptive Subgraph-Level Kernels on GPUs . ( CF 2023 ). pdf
- DVABatch: Diversity-aware Multi-Entry Multi-Exit Batching for Efficient Processing of DNN Services on GPUs . ( ATC 2022 ). pdf CCF-A
- PAME: Precision-Aware Multi-Exit DNN Serving for Reducing Latencies of Batched Inferences . ( ICS 2022 ). pdf CCF-B
- Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS . ( HPCA 2022 ). pdf CCF-A
- Enable Simultaneous DNN Services Based on Deterministic Operator Overlap and Precise Latency Prediction . ( SC 2021 ). pdf CCF-A
- Exploiting Intra-SM Parallelism in GPUs via Persistent and Elastic Blocks . ( ICCD 2021 ). pdf CCF-B
- CODA: Improving Resource Utilization by Slimming and Co-locating DNN and CPU Jobs . ( ICDCS 2020 ). pdf CCF-B
- Ebird: Elastic Batch for Improving Responsiveness and Throughput of Deep Learning Services . ( ICCD 2019 ). pdf CCF-B
- Laius: Towards Latency Awareness and Improved Utilization of Spatial Multitasking Accelerators in Datacenters . ( ICS 2019 ). pdf CCF-B
No matching items
Journal Papers
- Towards Fast Setup and High Throughput of GPU Serverless Computing . ( TACO 2025 ). CCF-A
- ARACHNE: Optimizing Distributed Parallel Applications with Reduced Inter-Process Communication . ( TACO 2025 ). CCF-A
- Taming Flexible Job Packing in Deep Learning Training Clusters . ( TACO 2025 ). CCF-A
- Adaptive Kernel Fusion for Improving the GPU Utilization while Ensuring QoS . ( TC 2024 ). pdf CCF-A
- Accelerating Sparse DNNs based on Tiled Gemm . ( TC 2024 ). pdf CCF-A
- Improving Cluster Utilization through Adaptive Resource Management for DNN and CPU Jobs Co-location . ( TC 2023 ). pdf CCF-A
- ISPA: Exploiting Intra-SM Parallelism in GPUs via Fine-grained Resource Management . ( TC 2022 ). pdf CCF-A
- Toward QoS-Awareness and Improved Utilization of Spatial Multitasking GPUs . ( TC 2021 ). pdf CCF-A
- E2bird: Enhanced Elastic Batch for Improving Responsiveness and Throughput of Deep Learning Services . ( TPDS 2020 ). pdf CCF-A
No matching items