Long Scoreboard Warp Stalls in Radix

Feb 25, 2026

Optimized a custom radix-sort pipeline using Nsight Compute. Profiling Warp State Statistics and uncoalesced global accesses reduced long-scoreboard stalls by over 50%, illustrating GPU memory-hierarchy trade-offs and optimization trade-offs.

Highlights

Root cause: Uncoalesced global loads/stores triggered 11+ cycle stalls per warp (Warp State Statistics).
Fixes attempted: Moving hot inputs to SRAM reduced long-scoreboard stalls by ~50% but surfaced new bottlenecks (barriers, short-scoreboards) that require synchronization and algorithmic attention.
Trade-offs: The prefix kernel incurred ~20% slowdown due to sparse writes — this needs an algorithmic redesign to recover performance while preserving overall correctness.
Outcome: SRAM utilization improved and the radix kernel remained stable; the work highlights the iterative, trade-off-driven nature of CUDA performance tuning.

Radix sort long-scoreboard profile

See the kernel code on GitHub: radix_v2.cu and the full analysis: run2 results.