- CVPR 2026 Content-Aware Dynamic Patchification for Efficient Video Diffusion [paper]
- arXiv Chunk-Level KV Cache Reuse for Efficient RAG Serving [paper]
- arXiv Accelerating 3D Gaussian Splatting with Tensor Cores [paper]
- arXiv Swap-Free Quantum LDPC Code Mapping on Near-Term Local Architecture [paper]
- arXiv EdgeOL: Efficient in-situ Online Learning on Edge Devices [paper]
- arXiv Non-Clifford Fusion: T-Gate Optimization for Quantum Simulation [paper]
- arXiv Improving GPU Multi-Tenancy Through Dynamic Multi-Instance GPU Reconfiguration [paper]
- ICCAD 2025 STMC: Small-Tile Multiple-Copy Compilation for Reliable Measurement-Based Quantum Computing [paper]
- ICML 2025 MemFreezing: A Novel Adversarial Attack on Temporal Graph Neural Networks under Limited Future Knowledge [paper]
- ICS 2025 CIExplorer: Microarchitecture-Aware Exploration for Tightly Integrated Custom Instruction [paper]
- ISCA 2025 Reinforcement Learning-Guided Graph State Generation in Photonic Quantum Computers [paper]
- MLSys 2025 FastTree: Optimizing Attention Kernel and Runtime for Tree-Structured LLM Inference [paper]
- ASPLOS 2025 Cascade: A Dependency-aware Efficient Training Framework for Temporal Graph Neural Network [paper]
- ASPLOS 2025 Pruner: A Draft-then-Verify Exploration Mechanism to Accelerate Tensor Program Tuning [paper]
- ICLR 2025 Mutual Effort for Efficiency: A Similarity-based Token Pruning for Vision Transformers in Self-Supervised Learning [paper]
- HPCA 2025 OASIS: Object-Aware Page Management for Multi-GPU Systems [paper]
- MICRO 2024 STAR: Sub-Entry Sharing-Aware TLB for Multi-Instance GPU [paper]
- ASPLOS 2024 FMCC: Flexible Measurement-based Quantum Computation over Cluster State [paper]
- ASPLOS 2024 QRCC: Evaluating Large Quantum Circuits on Small Quantum Computers through Integrated Qubit Reuse and Circuit Cutting [paper]
- DAC 2024 FCM: Wire Cutting For Fusion Reduction in Measurement-based Quantum Computing [paper]
- DAC 2024 LOTUS: learning-based online thermal and latency variation management for two-stage detectors on edge devices [paper]
- ICLR 2024 Waxing-and-Waning: a Generic Similarity-based Framework for Efficient Self-Supervised Learning [paper]
- HPCA 2024 GRIT: Enhancing Multi-GPU Performance with Fine-Grained Dynamic Page Placement [paper]
- ICCD 2023 FlexGM: An Adaptive Runtime System to Accelerate Graph Matching Networks on GPUs [paper]
- MICRO 2023 IDYLL: Enhancing Page Translation in Multi-GPUs via Light Weight PTE Invalidations [paper]
- MICRO 2023 SupeRBNN: Randomized Binary Neural Network Using Adiabatic Superconductor Josephson Devices [paper]
- DAC 2023 Orchestrated Scheduling and Partitioning for Improved Address Translation in GPUs [paper]
- DAC 2023 Orchestrating Measurement-Based Quantum Computation over Photonic Quantum Processors [paper]
- DAC 2023 EP-ORAM: Efficient NVM-Friendly Path Eviction for Ring ORAM in Hybrid Memory [paper]
- ICLR 2023 Spotlight SmartFRZ: An Efficient Training Framework using Attention-Based Layer Freezing [paper]
- HPCA 2023 Trans-FW: Short Circuiting Page Table Walk in Multi-GPU Systems via Remote Forwarding [paper]
- HPCA 2023 CEGMA: Coordinated Elastic Graph Matching Acceleration for Graph Matching Networks [paper]
- HPCA 2023 AB-ORAM: Constructing Adjustable Buckets for Space Reduction in Ring ORAM [paper]
- HPCA 2022 Q-GPU: A Recipe of Optimizations for Quantum Circuit Simulation Using GPUs [paper]
- NeurIPS 2022 Layer Freezing & Data Sieving: Missing Pieces of a Generic Framework for Sparse Training [paper]
- ECCV 2022 You Already Have It: A Generator-Free Low-Precision DNN Training Framework using Stochastic Rounding [paper]
- ICCAD 2022 Fine-Granular Computation and Data Layout Reorganization for Improving Locality [paper]
- ICCD 2022 Enhancing GPU Performance via Neighboring Directory Table Based Inter-TLB Sharing [paper]
- TECS 2022 Automatic Mapping of the Best-Suited DNN Pruning Schemes for Real-Time Mobile Acceleration [paper]
- IEEE Micro 2022 Sustainable AI Processing at the Edge [paper]
- WWW 2022 Workshop Optimizing Data Layout for Training Deep Neural Networks [paper]
- EuroSys 2022 Poster Rethinking Latency-aware DNN Design with GPU Tail Effect Analysis [paper]
- CCF THPC 2022 An Efficient Segmented Quantization for Graph Neural Networks [paper]
- Arxiv 2022 Demystifying Arch-hints for Model Extraction: An Attack in Unified Memory System [paper]
- MICRO 2021 Improving Address Translation in Multi-GPUs via Sharing and Spilling Aware TLB Design [paper]
- ICCAD 2021 ScaleDNN: Data Movement Aware DNN Training on Multi-GPU [paper]
- ICCAD 2021 Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU [paper]
- SIGMETRICS 2021 Mix and Match: Reorganizing Tasks for Enhancing Data Locality [paper]
- PLDI 2021 Distance-in-Time versus Distance-in-Space [paper]
- PLDI 2021 Fluid: A Framework for Approximate Concurrency via Controlled Dependency Relaxation [paper]
- PPoPP 2021 Compiler Support for Near Data Computing [paper]
- AAAI 2021 YOLObile: Real-Time Object Detection on Mobile Devices via Compression-Compilation Co-Design [paper]
- AAAI 2021 A Compression-Compilation Co-Design Framework Towards Real-Time Object Detection on Mobile Devices [paper]
- CODES+ISSS 2021 Algorithm-Hardware Co-design of Attention Mechanism on FPGA Devices [paper]
- RTAS 2021 Work in Progress: Mobile or FPGA? A Comprehensive Evaluation on Energy Efficiency and a Unified Optimization Framework [paper]
- WWW 2021 Workshop Parallelizing DNN Training on GPUs: Challenges and Opportunities [paper]
- ATS 2021 Towards a Secure Integrated Heterogeneous Platform via Cooperative CPU/GPU Encryption [paper]
- NAS 2021 Characterizing AI Model Inference Applications Running in SGX Environment [paper]
- PACT 2020 Enhancing Address Translations in Throughput Processors via Compression [paper]
- TCAD 2020 Enabling Latency-aware Data Initialization for Integrated CPU/GPU Heterogeneous Platform [paper]
- NeurIPS 2020 Workshop YOLObile: Real-Time Object Detection on Mobile Devices via Compression-Compilation Co-Design [paper]
- ISCA 2019 Opportunistic Computing in GPU Architectures [paper]
- PLDI 2019 Co-Optimizing Memory-Level Parallelism and Cache-Level Parallelism [paper]
- SIGMETRICS 2019 Quantifying Data Locality in Dynamic Parallelism in GPUs [paper]
- SIGMETRICS 2019 Computing with Near Data [paper]
- SIGMETRICS 2019 Architecture-Aware Approximate Computing [paper]
- HiPC 2019 Architecture-Centric Bottleneck Analysis for Deep Neural Network Applications [paper]
- PLDI 2018 Enhancing Computation-to-Core Assignment with Physical Location Information [paper]
- MASCOTS 2018 Quantifying and Optimizing Data Access Parallelism on Manycores [paper]
- GPGPU 11 @ PPoPP 2018 Oversubscribed Command Queues in GPUs [paper]
- MICRO 2017 Data Movement Aware Computation Partitioning [paper]
- HPCA 2017 Controlled Kernel Launch for Dynamic Parallelism in GPUs [paper]
- MASCOTS 2017 DEMM: a Dynamic Energy-saving mechanism for Multicore [paper]
- PACT 2017 Poster POSTER: Location-Aware Computation Mapping for Manycore Processors [paper]
- MICRO 2016 Improving Bank-Level Parallelism for Irregular Applications [paper]
- PACT 2016 Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities [paper]
- PACT 2016 µC-States: Fine-grained GPU Datapath Power Management [paper]
- PLDI 2015 Optimizing Off-Chip Accesses in Manycores [paper]
- SIGMETRICS 2015 Memory Row Reuse Distance and its Role in Optimizing Application Performance [paper]
- SNPD 2013 A Video Coding Benchmark Suite for Evaluation of Processor Capability [paper]
- PPoPP 2012 Poster FlexBFS: A Parallelism-aware Implementation of Breadth-First Search on GPU [paper]