GPU Kernel Optimization: Designed a Monte Carlo Tree Search (MCTS)–based agent for automated GPU kernel optimization.
Automatic GitHub Issue Resolution: Developed a task-graph–based multi-agent framework enabling precise plan execution; achieved state-of-the-art performance on SWE-bench-lite, resolving 28.33% of issues (June 4–17, 2024).
Program Synthesis for Locality Analysis: Proposed and implemented an input–output–example–driven, syntax-guided synthesis framework for program locality analysis; designed a domain-specific language (DSL) and a unification-based search algorithm to efficiently explore the program space.
2. LLM Infra (Team Leader)[DATE'26]2025.01–2026.04
Developed a learned configuration generator for dynamic-shape Triton GEMM kernels, integrated with vLLM to deliver 1.37×/1.42× average end-to-end latency speedups on LLaMA-3.1/Qwen3 dynamic-batching workloads.
Implementing a tile-level distributed computation and communication overlapping inference/training framework for dense/MoE models.
3. SLM Training (Team Leader)2025.06–2026.04
RL, finetuning, continuous pre-training, evaluation with Qwen3 models for endurance running activity summarization, using synthetic Huawei Watch-style wearable data.
Investigating structure pruning and new model architectures (ds-v3.2, qwen3-next, mixtrue-of-lookup-experts) for edge devices.
4. Static Analysis for Memory Safety2022.05–2024.12Details
Explores techniques to reason about program properties automatically (sparse-value flow analysis, abstract interpretation, etc).
Implements tools to identify memory bugs for large-scale industrial codes, such as null pointer dereference, memory leaks, etc.