Matt J. Borowski
Homepage
Matt Borowski

Hi there! My name is Matt and I focus on CUDA kernels.

I did my MSc in Applied Mathematics at University of Oxford.

You can view some of my work below or on my GitHub.

I began working on CUDA in November 2025 and have since implemented and profiled 50+ CUDA kernels, using Nsight Compute to deliver measurable speedups (e.g., ~30% on Flash MHA). Prior to focusing on CUDA, I worked on Computer Vision for microscopic imaging.

In addition to CUDA I am interested in Machine Learning & Data Centers.

Feel free to reach out!

Some of my work

Capacity-aware Sparse MoE

The first profiling run for Spare MoE shows fused kernels vastly outperform unfused: the unfused workflow spends ~2034 ms dominated by WMMA up_proj/Swiglu/down_proj kernels, while the fused baseline variant runs in ~54 ms and the capacity aware version in ~37 ms. Capacity-aware per-expert buffering substantially improves compute/memory balance (≈46% speedup vs baseline) by reducing redundant DRAM↔shared copies and improving locality.

Twelve Parallel Reduction Kernels

CUDA benchmark suite implementing 12 parallel-reduction kernels with comprehensive Nsight Compute analysis and charts comparing latency, memory throughput, scheduler stats, and instruction/source counters. The results show bandwidth-bound behavior where the highest DRAM throughput (the atomic_global kernel) yields the best runtimes, while divergence and instruction overhead (e.g., interleaved_addr_divergent_branch) severely hurt performance.

Int8 Flash Attention: smaller blocks = higher occupancy

Compared two block sizes for my custom int8 Flash Attention kernel. Top-line: Br=32: 9.06 ms vs Br=64: 12.17 ms → ≈25% faster. Memory throughput: ≈69% vs ≈50% → ~38% higher; L2 throughput ≈2.1× higher. Compute & memory % of peak both rose ≈50% → ≈69% (better utilization). Br=64 uses more shared memory/registers per block, reducing resident blocks/SM and amplifying load imbalance and unavoidable per-warp accumulation barriers. Br=32 fits more blocks/SM, boosts L2 locality, and yields better throughput.

Warp work distribution in Flash Attention

Flash Attention with 8×32×16 WMMA tiles along with d-axis warp work split enable more warps per block in Flash Attention and hike occupancy 100% (8→16 warps per scheduler) for Br=64. SRAM pressure, however, bars padding, leading to bank conflicts. A detailed analysis with Nsight Compute on parallelism trade-offs.

Tensor Cores + Multi-Head Attention

I profiled the standard fused Flash Attention kernel against a Tensor Core–optimized version. Result: 30.7% runtime speedup (8.33ms → 5.77ms) using WMMA for the matmuls, where one warp owns a 16×d chunk of Q and processes a single 16×16×16 tile at a time.

Long Scoreboard Warp Stalls in Radix

Optimized a custom radix-sort pipeline using Nsight Compute. Profiling Warp State Statistics and uncoalesced global accesses reduced long-scoreboard stalls by over 50%, illustrating GPU memory-hierarchy trade-offs and optimization trade-offs.

Radix Sort & Top-P Sampling Kernels in LLMs

Whenever you interact with ChatGPT, Gemini, or any modern LLM, the final inference step determining response speed is top-p (nucleus) sampling - the algorithm that selects tokens from the model's vocabulary. I've conducted CUDA kernel profiling analysis of this critical sampling operation. The results reveal a clear performance bottleneck.

Unfused Multi-Head Attention vs Flash Attention

Profiling analysis of unfused MHA (mma + softmax + mma) compared to a Flash Attention kernel where one warp owns four rows of Q. Nsight Compute highlights three primary bottlenecks: bank conflicts, low occupancy, and sub‑optimal grid utilization.

Convolution Kernel profiling

Optimized a convolution kernel using shared memory, constant memory, and load widening. Latency decreased from 4.4 ms to 3.0 ms via Nsight Compute profiling and targeted bottleneck fixes.

GEMM Evolution: CUDA suite with 13 kernels

Progressive CUDA GEMM suite implementing 13 kernels, from naive and anti-patterns to shared/register tiling, float4/vectorized kernels, autotuned tile sizes, warp-level WMMA, and cp.async double-buffered pipelines. Includes cuBLAS references. The unified runner records GFLOPS, GB/s, arithmetic intensity (AI), and occupancy across matrix sizes (1K–8K), and generates roofline plots plus Markdown performance tables.

Math Olympiad Agent: Self-Verification for IMO Problems

This project implements an AI agent for solving International Mathematical Olympiad (IMO)-level problems, inspired by the self-verification pipeline described in the paper "Gemini 2.5 Pro Capable of Winning Gold at IMO 2025". The agent uses a multi-stage iterative process with rigorous verification to generate and refine solutions. Built with LangGraph for workflow orchestration and supports the A2A protocol for multi-agent collaboration.