Auto-tuning framework for triton-based kernels with dynamic shapes, integrated into the vLLM framework and evaluated on LLaMA3.1 and Qwen3 models, achieving average end-to-end speedups of 1.37× and 1.42× respectively, under dynamic batching workloads.
Implementing a tile-level distributed computation and communication-overlapping inference/training framework for dense/MoE models.
GPU Kernel Optimization: Designed a Monte Carlo Tree Search (MCTS)–based agent for automated GPU kernel optimization.
Automatic GitHub Issue Resolution: Developed a task-graph–based multi-agent framework enabling precise plan execution; achieved state-of-the-art performance on SWE-bench-lite, resolving 28.33% of issues (June 4–17, 2024).
Program Synthesis for Locality Analysis: Proposed and implemented an input–output–example–driven, syntax-guided synthesis framework for program locality analysis; designed a domain-specific language (DSL) and a unification-based search algorithm to efficiently explore the program space.
4. Static Analysis for Memory Safety2022.05–2024.12Details
Explores techniques to reason about program properties automatically (sparse-value flow analysis, abstract interpretation, etc).
Implements tools to identify memory bugs for large-scale industrial codes, such as null pointer dereference, memory leaks, etc.