Projects — TANG Lab

Scalable GPU & AI Hardware Systems

Multi-GPU memory systems, GNN acceleration, hybrid DNN parallelism, LLM serving, KV-cache reuse, and scheduling.

Edge AI Systems

Adaptive continual learning, self-supervised training, sparse computation, and real-time inference under tight resource budgets.

AI for Computer Vision

Dynamic patchification, token pruning, visual representation learning, 3D Gaussian splatting, and efficient video generation.

Quantum Computing Systems

Fault-tolerant compilation, photonic graph-state generation, qLDPC decoding, and quantum-classical acceleration.

Featured Projects

DynaPatch generated videos with content-aware patch allocation

DynaPatch: Content-Aware Dynamic Patchification for Efficient Video Diffusion

A content-aware dynamic patchification framework for video diffusion transformers. A lightweight router predicts region-wise patch sizes from 3D VAE latents, allocating fine-grained tokens to complex motion regions while coarsening static backgrounds. Achieves 1.3–1.8x inference speedup with minimal quality degradation on standard benchmarks.

Video Diffusion Dynamic Patchification Efficient Generation

Project Page

Chunk-Level KV Cache Reuse for Efficient RAG Serving

A serving-system approach that restructures retrieved chunks to unlock direct prefix-cache reuse in RAG workloads. Frequency-guided reordering aligns chunk prefixes across requests, enabling direct KV cache hits without recomputation. Reduces time-to-first-token by up to 1.6x and improves serving throughput by up to 2.2x.

LLM Serving KV Cache RAG Prefix Caching GPU Infrastructure

Project Page

Scalable and Adaptive Memory Management for Multi-GPU Unified Memory Systems

A runtime system for dynamic page placement in UVM-enabled multi-GPU platforms. GRIT learns per-page behavior using fault-aware indicators and neighbor prediction; OASIS shifts control to object granularity for lower overhead. Together they deliver up to 2.7x speedup over default UVM on multi-GPU AI workloads.

Multi-GPU Systems Scalable Memory Systems Adaptive Page Management

Project Page

SimPrune pruning visualization vs. conventional attention-based methods

SimPrune: Similarity-Based Token Pruning for Efficient Self-Supervised Vision Transformers

A cross-branch token pruning strategy for dual-encoder self-supervised learning. SimPrune matches and prunes token pairs based on cosine similarity between online and target branches, producing more symmetric and semantically consistent pruning than attention-based methods. Achieves 24% training FLOPs reduction without sacrificing downstream accuracy.

Token Pruning Self-Supervised Learning Vision Transformers Efficient Training Edge AI

Project Page

Research Areas

Scalable GPU and AI Hardware Systems

Active

Building the architecture, memory, and runtime systems for foundation-scale AI. Our work targets the systems barriers that limit multi-GPU platforms and AI workloads: distributed address translation, page placement, inter-GPU data movement, GNN irregularity, hybrid DNN parallelism, attention execution, KV-cache growth, and multi-tenant interference.

GPU Architecture GNN Acceleration LLM Systems KV Cache RAG Serving Runtime Systems

Selected Publications

arXiv — Chunk-Level KV Cache Reuse for Efficient RAG Serving
HPCA 2025 — OASIS: Object-Aware Page Management for Multi-GPU Systems
MLSys 2025 — FastTree: Optimizing Attention Kernel and Runtime for Tree-Structured LLM Inference
ASPLOS 2025 — Cascade: A Dependency-Aware Efficient Training Framework for Temporal GNNs
HPCA 2024 — GRIT: Enhancing Multi-GPU Performance with Fine-Grained Dynamic Page Placement
MICRO 2024 — STAR: Sub-Entry Sharing-Aware TLB for Multi-Instance GPU

Edge AI Systems

Active

Enabling AI models to adapt continuously on mobile GPUs and IoT-class devices without exceeding strict memory, latency, energy, and thermal budgets.

Edge Computing On-Device Learning Self-Supervised Learning Hardware-Software Co-design Efficient Inference

Selected Publications

arXiv — EdgeOL: Efficient In-situ Online Learning on Edge Devices
ICLR 2025 — Mutual Effort for Efficiency: Similarity-Based Token Pruning for Vision Transformers
ICLR 2024 — Waxing-and-Waning: Efficient Self-Supervised Learning
ICLR 2023 Spotlight — SmartFRZ: Attention-Based Layer Freezing for Efficient Training

AI for Computer Vision

Active

Developing systems techniques for high-quality visual intelligence and generation. A central direction is adaptive video generation, where patchification and pruning co-evolve with the denoising process so computation follows the regions that matter most.

Computer Vision Video Diffusion Dynamic Patchification 3D Gaussian Splatting Efficient Training

Selected Publications

CVPR 2026 — Content-Aware Dynamic Patchification for Efficient Video Diffusion
arXiv — Accelerating 3D Gaussian Splatting with Tensor Cores
ICLR 2025 — Mutual Effort for Efficiency: Similarity-Based Token Pruning for Vision Transformers
ICLR 2024 — Waxing-and-Waning: Efficient Self-Supervised Learning

Quantum Computing Systems

Active

Building compiler, architecture, and classical-acceleration support for scalable and fault-tolerant quantum computing.

Quantum Compilers Photonic Computing Fault Tolerance qLDPC Decoding Quantum-Classical Systems

Selected Publications

ICCAD 2025 — STMC: Small-Tile Multiple-Copy Compilation for Reliable Measurement-Based Quantum Computing
ISCA 2025 — Reinforcement Learning-Guided Graph State Generation in Photonic Quantum Computers
ASPLOS 2024 — FMCC: Flexible Measurement-Based Quantum Computation over Cluster State
ASPLOS 2024 — QRCC: Evaluating Large Quantum Circuits on Small Quantum Computers