Monday, June 15, 2026
HomeArtificial IntelligenceMeet Flash-KMeans: An IO-Conscious, Precise Okay-Means That Runs Over 200× Quicker Than...

Meet Flash-KMeans: An IO-Conscious, Precise Okay-Means That Runs Over 200× Quicker Than FAISS on GPUs

k-means has been an offline device for many years. You run it as soon as to preprocess information, then transfer on. A group of researchers from UC Berkeley and UT Austin launched Flash-KMeans, a brand new open-source library that targets a distinct setting. Trendy AI pipelines now name k-means inside coaching and inference loops. At that frequency, latency per name issues greater than theoretical FLOPs.

Flash-KMeans is an IO-aware implementation of normal Lloyd’s k-means. It doesn’t change the mathematics, and it doesn’t approximate. It solely restructures how the algorithm strikes information on a GPU. On an NVIDIA H200, the analysis group reported as much as 17.9× end-to-end speedup over the very best baseline. Towards NVIDIA cuML they report 33×. Towards FAISS they report over 200×.

What’s Flash-KMeans

Flash-KMeans is a batched k-means library written in Triton GPU kernels. It ships below Apache 2.0 and installs with pip set up flash-kmeans.

The output is mathematically similar to plain Lloyd’s k-means. The speedup comes from kernel-level dataflow, not from skipping work. That separates it from algorithmic strategies like triangle-inequality pruning or coreset sampling.

A normal Lloyd iteration has two levels. The task stage computes every level’s distance to each centroid, then picks the closest. The replace stage averages the factors in every cluster to kind new centroids. Each levels are easy arithmetic. On GPUs, each are bottlenecked by reminiscence, not compute.

The Two Bottlenecks It Assaults

The first bottleneck is the task stage. Commonplace code builds a full distance matrix D of form N×Okay in Excessive Bandwidth Reminiscence (HBM). It writes the matrix, then reads it again to run argmin. For N=65536, Okay=1024, d=128, B=32, the gap math takes 2.6ms. Writing and consuming D takes about 23ms. The matrix is the price, not the arithmetic.

Flash-KMeans replaces this with FlashAssign. The design borrows from FlashAttention. FlashAssign streams tiles of factors and centroids from HBM into on-chip SRAM. It fuses distance computation with a web based argmin. The total N×Okay matrix is rarely materialized. This cuts the dominant IO complexity from O(NK) to O(Nd + Kd). On the kernel stage, FlashAssign reaches as much as 21.2×. In a single case it reduce task from 122.5ms to five.8ms.

The second bottleneck is the centroid replace stage. Commonplace code makes use of scatter-style atomic provides. Every thread provides its level right into a shared sum buffer keyed by cluster id. Many threads hit the identical ‘sizzling’ cluster directly. That causes atomic competition and {hardware} serialization. The analysis group measured solely 50 GB/s efficient bandwidth right here on an H200.

Flash-KMeans replaces this with Kind-Inverse Replace. It types the 1D task vector by cluster id utilizing argsort. Similar cluster ids then kind contiguous segments. Every thread block reduces a phase on-chip, then points one atomic add per phase. The heavy level matrix is rarely bodily permuted. Atomic operations drop from (O((Okay+NBN)d))(O((Okay + frac{N}{B_N})d)) . The kernel reaches as much as 6.3×.

Benchmark

The analysis group take a look at it on an H200 with CUDA 12.8, FP16 information, and d=128. They sweep N, Okay, and batch measurement B. They examine in opposition to 4 optimized baselines: fast_pytorch_kmeans, fastkmeans, cuML, and FAISS.

Comparability Reported speedup Workload context
Finish-to-end vs greatest baseline as much as 17.9× N=8M, Okay=1024 (massive N, small Okay)
vs NVIDIA cuML 33× {industry} library
vs FAISS over 200× {industry} library
FlashAssign kernel as much as 21.2× N=1M, Okay=8192 (task)
Kind-Inverse Replace kernel as much as 6.3× N=33M, Okay=4096 (replace)
Out-of-core, massive scale as much as 10.5× N=400M, Okay=16384 vs fastkmeans

One failure mode issues for context. Commonplace PyTorch implementations run out of reminiscence in large-Okay regimes. They can not materialize the N×Okay matrix. FAISS is the industry-standard library below many manufacturing vector-search methods.

The library additionally runs out-of-core. On one billion factors (Okay=32768, d=128), it finishes an iteration in 41.4s, in opposition to 261.8s for the baseline. It makes use of chunked stream overlap to cover PCIe switch behind compute. A cache-aware compile heuristic additionally cuts tuning overhead by as much as 175×, inside 0.3% of tuned velocity.

MTP Interactive Explainer

Marktechpost · Interactive Explainer

Flash-KMeans: precise k-means, rebuilt round GPU reminiscence

Similar Lloyd’s math as commonplace k-means — sooner solely due to dataflow. Run clustering stay, watch the replace bottleneck, and measurement the IO it removes.

17.9×end-to-end vs greatest baseline

33×vs NVIDIA cuML

200×+vs FAISS

1Bfactors, out-of-core

1 · Reside clustering

2 · Replace competition

3 · IO calculator





Iteration0

Centroid shift

Standingidle

This runs actual Lloyd’s k-means in your browser on 2-D factors. The algorithm is similar to what Flash-KMeans accelerates — solely the GPU dataflow differs. Every step = one task + one centroid replace.

Press play. Commonplace scatter-update serializes when blocks write the identical “sizzling” centroid (crimson stalls). Kind-Inverse Replace types cluster IDs first, so every block merges contiguous segments with one atomic add — no battle.


Commonplace atomicsO(N·d)

Kind-Inverse atomicsO((Okay+N/B)·d)

Measured std bandwidth50 GB/s

Kernel speedup6.3×

Commonplace updates subject one atomic add per token. Many threads hit the identical centroid directly, inflicting competition. Sorting by cluster ID turns scatters into segment-level reductions in on-chip reminiscence.

Commonplace — materialize N×Okay matrix, O(NK)

FlashAssign — stream inputs, O(Nd+Kd)

much less HBM visitors for the task step (theoretical)

Use Circumstances

Quicker precise k-means modifications what you’ll be able to run on-line, not simply offline.

  • Vector search indexing: FAISS builds its search indices with k-means. Quicker k-means helps you to re-index as information shifts, as an alternative of rebuilding in a single day.
  • Sparse consideration routing: Routing Transformers and Tactic cluster tokens to route consideration. Millisecond k-means makes this viable contained in the inference loop.
  • KV-cache compression: ClusterKV clusters tokens in semantic house to compress the cache. Cheaper clustering makes per-layer, per-step compression sensible.
  • Low-bit KV quantization: Current strategies cluster KV entries into codebooks, repeatedly. Quicker clustering shrinks that preprocessing price.
  • Diffusion Transformers: Sparse VideoGen2 calls batched k-means throughout ahead passes. It permutes tokens by semantic similarity to use sparsity.

Utilizing It

The API mirrors faiss and sklearn. The decision beneath clusters a batched (B, N, d) tensor.

import torch
from flash_kmeans import batch_kmeans_Euclid

x = torch.randn(32, 75600, 128, system="cuda", dtype=torch.float16)
cluster_ids, facilities, _ = batch_kmeans_Euclid(
    x, n_clusters=1000, tol=1e-4, verbose=True
)

A scikit-learn-style interface can also be accessible.

from flash_kmeans import FlashKMeans

km = FlashKMeans(d=128, ok=8192, niter=100)
labels = km.fit_predict(large_cpu_tensor)  # system=None makes use of all seen GPUs

The kernel auto-dispatches by form and dtype. A small-D path handles d≤512. A split-D path handles bigger d with out materializing the gap matrix. Multi-GPU runs set off robotically for large-N information held in CPU reminiscence.

Key Takeaways

  • Flash-KMeans is precise, not approximate — similar Lloyd’s math, sped up purely by GPU dataflow.
  • FlashAssign fuses distance + on-line argmin, chopping task IO from O(NK) to O(Nd+Kd) — as much as 21.2×.
  • Kind-Inverse Replace types cluster IDs into segments, changing scatter atomics — as much as 6.3×.
  • Stories as much as 17.9× end-to-end, 33× over cuML, and over 200× over FAISS on an H200.
  • Scales out-of-core to at least one billion factors and cuts tuning overhead as much as 175×.

Try the Paper and RepoAdditionally, be happy to observe us on Twitter and don’t overlook to affix our 150k+ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you’ll be able to be part of us on telegram as properly.

Have to associate with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and many others.? Join with us


RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments