Multi-vector retrieval now on the infrastructure you already run

ColBERT-grade retrieval quality on the infrastructure teams already operate, at a fraction of the storage and serving cost.

June 16, 2026

TL;DR

Late interaction is the most accurate way to retrieve, but also the most expensive to serve. MUVERA and SMVE showed how to run ColBERT-style retrieval on existing infrastructure by generating candidates through standard HNSW or sparse indexes before reranking with full MaxSim. However, these methods degrade significantly on modern late-interaction models. We investigated why, and trained a regularized LateOn that restores compatibility with MUVERA and SMVE candidate generation. The result is ColBERT-grade retrieval quality on the infrastructure teams already operate, at a fraction of the storage and serving cost.

Why it matters

A ColBERT document is not represented by a single embedding but by a collection of token-level embeddings. That is what makes late interaction so accurate, and also what makes it difficult to serve at scale.

Retrieval is the foundation of any RAG or agentic system. Every mistake at retrieval propagates downstream, which is why late-interaction models such as ColBERT consistently remain among the strongest retrievers available.

The challenge is infrastructure. Multi-vector retrieval requires storing and searching token-level embeddings, making indexing and serving substantially more expensive than standard single-vector search.

To address this, the community introduced methods such as MUVERA and SMVE.

MUVERA builds a fixed-dimensional dense representation from a document's token embeddings, enabling candidate generation through standard HNSW indexes. SMVE builds a sparse representation from the same token embeddings, enabling candidate generation through sparse retrieval infrastructure.

Both approaches use existing retrieval systems to generate a shortlist of candidates before reranking them with full ColBERT MaxSim scoring. In principle, this offers the best of both worlds: efficient retrieval infrastructure and late-interaction accuracy.

The catch is that these methods were originally developed and evaluated on earlier ColBERT models. When applied to modern late-interaction models, candidate generation quality drops dramatically.

In this work, we investigate why this happens and introduce a regularized LateOn training procedure that restores the effectiveness of MUVERA and SMVE. By directly optimizing the compressed representations used during candidate generation, we recover high-quality retrieval while continuing to rely on the dense and sparse infrastructure organizations already operate.

Technical Deep Dive

Late interaction models (also known as ColBERT or multi-vector) models have exhibited very strong performances on various retrieval tasks such as out-of-domain, long-context, code and agentic retrieval. Despite these strong results, it comes with infrastructure challenges because it requires storing and searching all the embeddings corresponding each tokens in the documents. Besides the storage cost, it also relies on index structure to generate a candidates set before gathering and decompressing the document token embeddings and applying the full MaxSim operation. This is typically handled with PLAID indexes, which we made effortlessly and efficiently available to the broad audience through FastPlaid and NextPlaid.

Although those libraries enabled the use of ColBERT models at larger scale, PLAID can still be complicated to maintain at scale, notably updating the centroids (we added some mechanisms to follow the distribution shift in FastPlaid, but we still inherently share the limitations of IVF-PQ that would require to rebuild the whole index once the distribution shifts to much) and makes CRUD operations complicated. SPFresh offers some trade-off to make updates easier, but might face other scaling issues).
For these reasons, new methods have been proposed to generate candidates without relying on centroids.

Full breakdown 👉 here

Model card 👉 here

Ready to Transform Your Enterprise?

TL;DR

Why it matters

Technical Deep Dive

Full breakdown 👉 here

Model card 👉 here

Recent Blogs

Introducing LightOn-rerank: SOTA multimodal LLM reranker

Faceted search has entered the agentic era

Fewer Tokens, Better Answers: Give Your Agent Search, Not Raw Files

Ready to Transform Your Enterprise?