Projects

1. SLM Training (Team Leader) 2025.06–
  • RL, fine-tuning, continuous pre-training, evaluation with Qwen3 models for running performance summarization.
  • Investigating structured pruning and new model architectures (RWKV, Mamba, mixture-of-lookup-experts) for edge devices.
2. LLM Infra (Team Leader) [DATE'26] 2025.01–
  • Auto-tuning framework for triton-based kernels with dynamic shapes, integrated into the vLLM framework and evaluated on LLaMA3.1 and Qwen3 models, achieving average end-to-end speedups of 1.37× and 1.42× respectively, under dynamic batching workloads.
  • Implementing a tile-level distributed computation and communication-overlapping inference/training framework for dense/MoE models.
3. Coding Agent and Program Synthesis [ICSE-SEIP'26], [ACL-findings'25], [LMPL'25], [TOSEM'24], [TechReport'24], [TechReport'24], [TechReport'24], [Arxiv'24], [DRAFT] 2024.01–
  • GPU Kernel Optimization: Designed a Monte Carlo Tree Search (MCTS)–based agent for automated GPU kernel optimization.
  • Automatic GitHub Issue Resolution: Developed a task-graph–based multi-agent framework enabling precise plan execution; achieved state-of-the-art performance on SWE-bench-lite, resolving 28.33% of issues (June 4–17, 2024).
  • Program Synthesis for Locality Analysis: Proposed and implemented an input–output–example–driven, syntax-guided synthesis framework for program locality analysis; designed a domain-specific language (DSL) and a unification-based search algorithm to efficiently explore the program space.
4. Static Analysis for Memory Safety 2022.05–2024.12
Details
  • Explores techniques to reason about program properties automatically (sparse-value flow analysis, abstract interpretation, etc).
  • Implements tools to identify memory bugs for large-scale industrial codes, such as null pointer dereference, memory leaks, etc.
5. Compiler Leasing [TACO'22], [ISMM'21] 2019.01–2022.12
Details
  • Proposes a framework that enables fine-grained control of data replacements in a cache by a compiler.
  • Designs and implements an algorithm to derive optimal leases for each reference in a program to minimize cache misses.
6. Static Sampling for Locality Analysis [PLDI'18], [PPoPP'20-poster] 2018.05–2021.12
Details
  • Designs and implements an LLVM compiler pass that predicts the cache performance of loop nests.
  • Specializes loops to enable static profiling of reuse intervals.
7. Write Locality [MEMSYS'16] 2016.01–2016.12
Details
  • Designs and implements a linear-time algorithm to model cache writebacks from the memory access trace of a program.
  • Implements a scheduling algorithm to minimize writebacks by grouping co-running programs, with the writeback model.
8. OpenCL Performance Portability [HPCC'13], [Euro-Par'14] 2012.01–2013.12
Details
  • Designs a source-to-source translator based on LLVM infrastructure.
  • Automatically transforms OpenCL kernels for GPU with fine-grained parallelism to vectorized code for CPU.
Plain Academic