Publications
* indicates co-first author, † indicates corresponding author
Top-tier Mid-tier Recognized Preprint
2026
- NSDI ’26Flare: Anomaly diagnostics for divergent llm training in gpu clusters of thousand-plus scaleIn Proceedings of the 23rd USENIX Symposium on Networked Systems Design and Implementation, 2026
- NSDI ’26MuxTune: Efficient Multi-Task LLM Fine-Tuning in Multi-Tenant Datacenters via Spatial-Temporal Backbone MultiplexingIn Proceedings of the 23rd USENIX Symposium on Networked Systems Design and Implementation, 2026
- HPCA ’26LEGO: Supporting LLM-enhanced Games with One Gaming GPUIn 2026 IEEE International Symposium on High Performance Computer Architecture, 2026
2025
- arXivOptimizing SLO-oriented LLM Serving with PD-MultiplexingarXiv preprint arXiv:2504.14489, 2025
- arXivEfficient Function-as-a-Service for Large Language Models with TIDALarXiv preprint arXiv:2503.06421, 2025
- arXivBoosting Embodied AI Agents through Perception-Generation Disaggregation and Asynchronous Pipeline ExecutionarXiv preprint arXiv:2509.09560, 2025
- arXivHarli: SLO-Aware Co-location of LLM Inference and PEFT-based Finetuning on Model-as-a-Service PlatformsarXiv e-prints, 2025
- ASPLOS ’25Voyager: Input-Adaptive Algebraic Transformations for High-Performance Graph Neural NetworksIn Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2025
- SC ’25A Sample-Free Compilation Framework for Efficient Dynamic Tensor ComputationIn Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025
- MLSys ’25Comet: Fine-grained computation-communication overlapping for mixture-of-expertsIn Proceedings of the 8th Annual Conference on Machine Learning and Systems (MLSys), 2025
- EuroSys ’25Improving GPU Sharing Performance through Adaptive Bubbleless Spatial-Temporal SharingIn Proceedings of the Twentieth European Conference on Computer Systems, 2025
- ATC ’25Efficient Performance-Aware GPU Sharing with Compatibility and Isolation through Kernel Space InterceptionIn 2025 USENIX Annual Technical Conference, 2025
- HPCA ’25VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM InferenceIn 2025 IEEE International Symposium on High Performance Computer Architecture, 2025
- TACO ’25EDAS: Enabling Fast Data Loading for GPU Serverless ComputingACM Transactions on Architecture and Code Optimization, 2025
- TACO ’25Taming Flexible Job Packing in Deep Learning Training ClustersACM Transactions on Architecture and Code Optimization, 2025
- TACO ’25ARACHNE: Optimizing distributed parallel applications with reduced inter-process communicationACM Transactions on Architecture and Code Optimization, 2025
- TACO ’25Ares: Fair and Efficient Scheduling of Deep Learning Jobs with Elastic Fair QueuingACM Transactions on Architecture and Code Optimization, 2025
- APPT ’25DACO: Unlocking Latent Dataflow Opportunities in Edge-Side SIMT AcceleratorsIn International Symposium on Advanced Parallel Processing Technologies, 2025
2024
- arXivA codesign of scheduling and parallelization for large model training in heterogeneous clustersarXiv preprint arXiv:2403.16125, 2024
- TC ’24Accelerating sparse dnns based on tiled gemmIEEE Transactions on Computers, 2024
- arXivThe CAP principle for LLM serving: A survey of long-context large language model servingarXiv preprint arXiv:2405.11299, 2024
- TC ’24Adaptive Kernel Fusion for Improving the GPU Utilization While Ensuring QoSIEEE Transactions on Computers, 2024
2023
- OSDI ’23Optimizing dynamic neural networks with BrainstormIn 17th USENIX Symposium on Operating Systems Design and Implementation, 2023
- TC ’23Improving Cluster Utilization Through Adaptive Resource Management for Deep Neural Network and CPU Jobs ColocationIEEE Transactions on Computers, 2023
- SoCC ’23Maximizing the utilization of GPUs used by cloud gaming through adaptive co-location with comboIn Proceedings of the 2023 ACM Symposium on Cloud Computing, 2023
- ICPADS ’23Microless: Cost-efficient hybrid deployment of microservices on iaas vms and serverlessIn 2023 IEEE 29th International Conference on Parallel and Distributed Systems, 2023
- CF ’23AdaptGear: Accelerating GNN Training via Adaptive Subgraph-Level Kernels on GPUsIn Proceedings of the 20th ACM International Conference on Computing Frontiers, 2023
2022
- ATC ’22DVABatch: Diversity-aware Multi-Entry Multi-Exit batching for efficient processing of DNN services on GPUsIn 2022 USENIX Annual Technical Conference, 2022
- HPCA ’22Tacker: Tensor-CUDA core kernel fusion for improving the GPU utilization while ensuring QoSIn 2022 IEEE International Symposium on High-Performance Computer Architecture, 2022
- TC ’22ISPA: Exploiting intra-SM parallelism in GPUs via fine-grained resource managementIEEE Transactions on Computers, 2022
- ICS ’22PAME: precision-aware multi-exit DNN serving for reducing latencies of batched inferencesIn Proceedings of the 36th ACM International Conference on Supercomputing, 2022
2021
- TC ’21Toward QoS-awareness and improved utilization of spatial multitasking GPUsIEEE Transactions on Computers, 2021
- ICCD ’21Exploiting intra-sm parallelism in gpus via persistent and elastic blocksIn 2021 IEEE 39th International Conference on Computer Design, 2021
- SC ’21Enable simultaneous DNN services based on deterministic operator overlap and precise latency predictionIn Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021
2020
- TPDS ’20E^2bird: Enhanced elastic batch for improving responsiveness and throughput of deep learning servicesIEEE Transactions on Parallel and Distributed Systems, 2020
- ICDCS ’20CODA: Improving resource utilization by slimming and co-locating DNN and CPU jobsIn 2020 IEEE 40th International Conference on Distributed Computing Systems, 2020
2019
- ICCD ’19Ebird: Elastic batch for improving responsiveness and throughput of deep learning servicesIn 2019 IEEE 37th International Conference on Computer Design, 2019
- ICS ’19Laius: Towards latency awareness and improved utilization of spatial multitasking accelerators in datacentersIn Proceedings of the ACM international conference on supercomputing, 2019