Radix Sort & Top-P Sampling Kernels in LLMs
Feb 17, 2026
Whenever you interact with ChatGPT, Gemini, or any modern LLM, the final inference step determining response speed is top-p (nucleus) sampling - the algorithm that selects tokens from the model’s vocabulary.
I’ve conducted CUDA kernel profiling analysis of this critical sampling operation. The results reveal a clear performance bottleneck.
📊 Radix sort dominates execution time (71-83% of total processing) and becomes increasingly problematic with larger vocabularies, reaching up to 83% of total time at 131K vocabulary size.
This bottleneck presents a significant optimization opportunity for improving LLM inference latency.
You can see the full analysis and code: here.