Convolution Kernel profiling
Jan 26, 2026
I optimized a convolution kernel by removing shared-memory bottlenecks, moving suitable constants to __constant__, and applying load-widening techniques. These changes reduced warp-stalling and increased ALU utilization (approx. 5% → 50%), yielding the measured latency improvement.
Highlights
- Memory strategy: Moved kernel constants to
__constant__where beneficial (intra-warp broadcast). - Load widening:
#pragma unrolland autotuning further alleviated MIO throttle stalls. - Profiling: Nsight Compute (NCU) for hotspots and SASS/PTX inspection for instruction mix.
- Outcome: Reduced stalls, higher ALU throughput, and small codegen shifts (more integer ops).
See the kernels: ConvProfNCU/kernels and the full analysis: profiling_summary.md.