LLM System Tuning & Low-Level Inference. Studying GPU Memory Hierarchy (HBM/SRAM), OS Paging (PagedAttention logic), and Transformer internals (GQA/RoPE) for scalable serving.
Currently hacking on
Benchmarking inference servers. Focus on Compute-Bound vs. Memory-Bound analysis and implementing high-throughput scheduling (Continuous Batching) for deep learning models.
Available for
High-Performance LLM Serving, MLOps, and GPU Performance Tuning. Say Hey if you're optimizing latency, designing PagedAttention systems, or debugging Memory-Bound bottlenecks.