PyTorch Approx Top-k

Approximate algorithms for computing top-k faster on machine learning accelerators, by using bucketing to increase parallelism. Rather than computing a single top-k over the sequence:

split the sequence into $b$ interleaved buckets
take $k_{b}$ elements from each bucket
if $k_{b} \cdot b > k$ : take a final top-k

You can get pretty nice speedups (e.g. several times) with little loss in recall! See our paper for detailed benchmarks and analysis of the cost/quality trade-off:

Approximate Top-k for Increased Parallelism; O Key, L Ribar, A Cattaneo, L Hudlass-Galley, D Orr

The implementation is quite fast, but we welcome any contributions from CUDA experts. In Figure 1, we compare against torch.argmax(), which is a reasonable upper-bound on how fast this kernel could be. There's still room for improvement!

Using the library

Requires: Python >3.10, PyTorch >=2.4, Ninja (ninja-build), CUDA toolkit matching your version of PyTorch

pip install git+https://github.com/graphcore-research/pytorch-approx-topk.git

Usage:

from approx_topk import topk as approx_topk
import torch

x = torch.randn(128, int(2**20), device="cuda")
values, indices = approx_topk(x, k=int(2**16), dim=-1, j=2, k_mult=1)

(the kernel is compiled on first use, which might take a while)

Note that, when comparing to the paper, j is $k_{b}$ and k_mult is $k_{b} \cdot b / k$ .

Repository highlights

approx_topk.priority_queue: main CUDA kernel supporting $k_{b} \in 1, 2, 4$ , implemented using a priority queue algorithm
approx_topk.experimental.bucketed_argmax: implementations for $k_{b} = 1$ only, using torch.argmax() and custom Triton kernels
benchmarks.measure_speed: benchmarks speed of our implementation vs exact top-ks (Figure 1 in paper)
- requires additional dependencies, see below
notebooks: experimental results notebooks (theoretical performance analysis, figure plotting)

Reproducing benchmarks + development

To set up the environment, install the dependencies:

CUDA toolkit 12.4
Ninja (ninja-build)
Python 3.11
Python Poetry

Then run poetry install --with benchmarks

To make it easier to install the CUDA dependencies, we provide an Apptainer image recipe in environment.simg:

Build: apptainer build environment.sif environment.simg
Run:
- apptainer exec --nv environment.sif python benchmarks/measure_speed.py
- apptainer exec --nv environment.sif python benchmarks/plot_bandwidth.py

Code tools:

Type checking: mypy --ignore-missing-imports -p approx_topk
Formatting Python: ruff format **/*.py
Formatting CUDA: clang-format -i **/*.cu

Name	Name	Last commit message	Last commit date
Latest commit oscarkey Update jinja2 dependency to fix CVE. Mar 22, 2025 c09e8c2 · Mar 22, 2025 History 141 Commits
approx_topk	approx_topk	Add note that the kernel selection heuristic is not very good.	Dec 4, 2024
benchmarks	benchmarks	Move plotting code for bandwidth results from notebook into script.	Dec 4, 2024
notebooks	notebooks	Move "data" into notebooks directory: it's only used by the notebooks.	Dec 4, 2024
tests	tests	Remove support for j=None.	Dec 3, 2024
.clang-format	.clang-format	Add PyTorch .clang-format file and reformat the priority queue kernel.	Nov 26, 2024
.gitattributes	.gitattributes	Ignore the notebooks from the github language statistics.	Dec 4, 2024
.gitignore	.gitignore	Add Apptainer image with the dependencies for building the CUDA kernel.	Nov 26, 2024
.python-version	.python-version	Specify Python version for pyenv.	Oct 24, 2024
LICENSE	LICENSE	Tweak license and README	Oct 24, 2024
README.md	README.md	Update paper link to point to arxiv.	Dec 12, 2024
environment.simg	environment.simg	Add Apptainer image with the dependencies for building the CUDA kernel.	Nov 26, 2024
poetry.lock	poetry.lock	Update jinja2 dependency to fix CVE.	Mar 22, 2025
pyproject.toml	pyproject.toml	Switch from black and isort to ruff.	Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyTorch Approx Top-k

Using the library

Repository highlights

Reproducing benchmarks + development

License

About

Contributors 4

Languages

License

graphcore-research/pytorch-approx-topk

Folders and files

Latest commit

History

Repository files navigation

PyTorch Approx Top-k

Using the library

Repository highlights

Reproducing benchmarks + development

License

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 4

Languages