Matt J. Borowski
Homepage

Capacity-aware Sparse MoE

Mar 08, 2026

SparseMoE run1 profiling shows fused kernels vastly outperform unfused: the unfused workflow spends ~2034 ms dominated by WMMA up_proj/Swiglu/down_proj kernels, while the fused baseline variant runs in ~54 ms and the capacity aware version in ~37 ms. Capacity-aware per-expert buffering substantially improves compute/memory balance (≈46% speedup vs baseline) by reducing redundant DRAM↔shared copies and improving locality.

Nsight Compute points to uncoalesced cp.async.ca.shared.global copies and WMMA load paths as primary hotspots, and flags a redundant __syncthreads() after cp.async.wait_group that contributes ~58% of warp stalls. Next steps are to remove unnecessary barriers, tighten per-expert buffering, and focus WMMA kernels for further gains.

See the full analysis on my GitHub here, and the kernels: unfused.cu, baseline.cu and capacity.cu.

Highlights

  • Performance: Fused (capacity) dramatically outperforms the unfused workflow.
  • Capacity: Per‑expert buffering reduces redundant DRAM↔shared copies and evens load.
  • Hotspots: cp.async.ca.shared.global copies and WMMA load paths are the primary sources of stalls.
  • Synchronization: Redundant __syncthreads() after cp.async.wait_group drives ~58% of warp stalls.
  • Memory ↔ Compute: capacity shifts the kernel from DRAM‑bound toward better compute/cache utilization.
  • Next Steps: Remove the redundant barrier, tighten per‑expert sizing, and optimize WMMA kernels/padding.

SparseMoE - uncoalesced access