You can view some of my work below or on my GitHub.
I began working on CUDA in November 2025 and have since implemented and profiled 50+ CUDA kernels, using Nsight Compute to deliver measurable speedups (e.g., ~30% on Flash MHA). Prior to focusing on CUDA, I worked on Computer Vision for microscopic imaging.
In addition to CUDA I am interested in Machine Learning & Data Centers.
The first profiling run for Spare MoE shows fused kernels vastly outperform unfused: the unfused workflow spends ~2034 ms dominated by WMMA up_proj/Swiglu/down_proj kernels, while the fused baseline variant runs in ~54 ms and the capacity aware version in ~37 ms. Capacity-aware per-expert buffering substantially improves compute/memory balance (≈46% speedup vs baseline) by reducing redundant DRAM↔shared copies and improving locality.
CUDA benchmark suite implementing 12 parallel-reduction kernels with comprehensive Nsight Compute analysis and charts comparing latency, memory throughput, scheduler stats, and instruction/source counters. The results show bandwidth-bound behavior where the highest DRAM throughput (the atomic_global kernel) yields the best runtimes, while divergence and instruction overhead (e.g., interleaved_addr_divergent_branch) severely hurt performance.
Compared two block sizes for my custom int8 Flash Attention kernel. Top-line: Br=32: 9.06 ms vs Br=64: 12.17 ms → ≈25% faster. Memory throughput: ≈69% vs ≈50% → ~38% higher; L2 throughput ≈2.1× higher. Compute & memory % of peak both rose ≈50% → ≈69% (better utilization). Br=64 uses more shared memory/registers per block, reducing resident blocks/SM and amplifying load imbalance and unavoidable per-warp accumulation barriers. Br=32 fits more blocks/SM, boosts L2 locality, and yields better throughput.
Flash Attention with 8×32×16 WMMA tiles along with d-axis warp work split enable more warps per block in Flash Attention and hike occupancy 100% (8→16 warps per scheduler) for Br=64. SRAM pressure, however, bars padding, leading to bank conflicts. A detailed analysis with Nsight Compute on parallelism trade-offs.
I profiled the standard fused Flash Attention kernel against a Tensor Core–optimized version. Result: 30.7% runtime speedup (8.33ms → 5.77ms) using WMMA for the matmuls, where one warp owns a 16×d chunk of Q and processes a single 16×16×16 tile at a time.
Optimized a custom radix-sort pipeline using Nsight Compute. Profiling Warp State Statistics and uncoalesced global accesses reduced long-scoreboard stalls by over 50%, illustrating GPU memory-hierarchy trade-offs and optimization trade-offs.
Whenever you interact with ChatGPT, Gemini, or any modern LLM, the final inference step determining response speed is top-p (nucleus) sampling - the algorithm that selects tokens from the model's vocabulary.
I've conducted CUDA kernel profiling analysis of this critical sampling operation. The results reveal a clear performance bottleneck.
Profiling analysis of unfused MHA (mma + softmax + mma) compared to a Flash Attention kernel where one warp owns four rows of Q. Nsight Compute highlights three primary bottlenecks: bank conflicts, low occupancy, and sub‑optimal grid utilization.
Optimized a convolution kernel using shared memory, constant memory, and load widening. Latency decreased from 4.4 ms to 3.0 ms via Nsight Compute profiling and targeted bottleneck fixes.
Progressive CUDA GEMM suite implementing 13 kernels, from naive and anti-patterns to shared/register tiling, float4/vectorized kernels, autotuned tile sizes, warp-level WMMA, and cp.async double-buffered pipelines. Includes cuBLAS references. The unified runner records GFLOPS, GB/s, arithmetic intensity (AI), and occupancy across matrix sizes (1K–8K), and generates roofline plots plus Markdown performance tables.
This project implements an AI agent for solving International Mathematical Olympiad (IMO)-level problems, inspired by the self-verification pipeline described in the paper "Gemini 2.5 Pro Capable of Winning Gold at IMO 2025". The agent uses a multi-stage iterative process with rigorous verification to generate and refine solutions. Built with LangGraph for workflow orchestration and supports the A2A protocol for multi-agent collaboration.