Publications | LightOn

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models

Antoine Chaffin, Luca Arnaboldi, Amélie Chatelain, Florent Krzakala

Current state-of-the-art multi-vector models are obtained through a small Knowledge Distillation (KD) training step on top of strong single-vector models, leveraging the large-scale pre-training of these models. In this paper, we study the pre-training of multi-vector models and show that large-scale multi-vector pre-training yields much stronger multi-vector models. Notably, a fully ColBERT-pre-trained model, ColBERT-Zero, trained only on public data, outperforms GTE-ModernColBERT as well as its base model, GTE-ModernBERT, which leverages closed and much stronger data, setting new state-of-the-art for model this size. We also find that, although performing only a small KD step is not enough to achieve results close to full pre-training, adding a supervised step beforehand allows to achieve much closer performance while skipping the most costly unsupervised phase. Finally, we find that aligning the fine-tuning and pre-training setups is crucial when repurposing existing models. To enable exploration of our results, we release various checkpoints as well as code used to train them.

How to Train Your Long-Context Visual Document Model

Austin Veselka

We present the first comprehensive, large-scale study of training long context vision language models up to 344K context, targeting long document visual question answering with measured transfer to long context text. While several such strong are open-weight, namely Qwen3 VL and GLM 4.5/6V, their training recipes and data pipelines are not reproducible. We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art performance on MMLongBenchDoc for both parameter scales. In addition to this, our key findings include: (i) training on context lengths that match evaluation context lengths outperforms training on longer contexts, (ii) training and evaluating with page indices provides a simple, high-impact boost to long-document performance, (iii) our synthetic data pipelines enable self-improvement via continued pretraining and supervised finetuning, and (iv) we extend the known text-to-visual long context transfer to the reverse, showing that visual long context training transfers to long-context text performance. We also release MMLBD-C, a manually corrected version of MMLongBenchDoc to reduce erroneous and low quality examples in the benchmark.

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

Said Taghadouini, Adrien Cavaillès, Baptiste Aubertin

We present \textbf{LightOnOCR-2-1B}, a 1B-parameter end-to-end multilingual vision--language model that converts document images (e.g., PDFs) into clean, naturally ordered text without brittle OCR pipelines. Trained on a large-scale, high-quality distillation mix with strong coverage of scans, French documents, and scientific PDFs, LightOnOCR-2 achieves state-of-the-art results on OlmOCR-Bench while being 9× smaller and substantially faster than prior best-performing models. We further extend the output format to predict normalized bounding boxes for embedded images, introducing localization during pretraining via a resume strategy and refining it with RLVR using IoU-based rewards. Finally, we improve robustness with checkpoint averaging and task-arithmetic merging. We release model checkpoints under Apache 2.0, and publicly release the dataset and \textbf{LightOnOCR-bbox-bench} evaluation under their respective licenses.

MCPyLate: MCP server using PyLate models for multi-vector search, PLAID.

Antoine Chaffin

Multi-vector search has shown very strong performance compared to single dense vector search in numerous domain, including out-of-domain, long-context and reasoning-intensive retrieval. They are thus particularly well suited for modern retrieval use cases, including agentic workflows. PyLate is library built on top of sentence-transformers that allows to easily train and use multi-vector models. This MCP server is a demonstration of the use of PyLate models alongside its index optimized for multi-vector search, PLAID.

FastPlaid: A High-Performance Engine for Multi-Vector Search

Raphaël Sourty

Traditional vector search relies on single, fixed-size embeddings (dense vectors) for documents and queries. While powerful, this approach can lose nuanced, token-level details.

Multi-vector search, used in models like ColBERT or ColPali, replaces a single document or image vector with a set of per-token vectors. This enables a "late interaction" mechanism, where fine-grained similarity is calculated term-by-term to boost retrieval accuracy.
Higher Accuracy: By matching at a granular, token-level, FastPlaid captures subtle relevance that single-vector models simply miss.
PLAID: stands for Per-Token Late Interaction Dense Search.
Blazing Performance: Engineered in Rust and optimized for GPUs.

‍

Reason-ModernColBERT: A late interaction model trained on the reasoning tasks

Antoine Chaffin

Reason-ModernColBERT is a late interaction model trained on the reasonir-hq dataset. It achieves extremely competitive performance on the BRIGHT benchmark aimed at evaluating reasoning-intensive retrieval performance, outperforming all existing models up to 7B (more than 45 times its size) and even surprisingly improving performance of ReasonIR-8B (a 8B model trained on the same data) by more than 2.5 NDCG@10

Pylate-rs: A Rust implementation of Pylate

Raphael Sourty

pylate-rs is a high-performance inference engine for PyLate models, meticulously crafted in Rust for optimal speed and efficiency.

While model training is handled by PyLate, which supports a variety of late interaction models, pylate-rs is engineered to execute these models at speeds.

Accelerated Performance: Experience significantly faster model loading and rapid cold starts, making it ideal for serverless environments and low-latency applications.
Lightweight Design: Built on the Candle ML framework, pylate-rs maintains a minimal footprint suitable for resource-constrained systems like serverless functions and edge computing.
Broad Hardware Support: Optimized for diverse hardware, with dedicated builds for standard CPUs, Intel (MKL), Apple Silicon (Accelerate & Metal), and NVIDIA GPUs (CUDA).
Cross-Platform Integration: Seamlessly integrate pylate-rs into your projects with bindings for Python, Rust, and JavaScript/WebAssembly.

For a complete, high-performance multi-vector search pipeline, pair pylate-rs with its companion library, FastPlaid, at inference time.

Explore our WebAssembly live demo.

Reframing Image Difference Captioning with BLIP2IDC and Synthetic Augmentation

Gautier Evennou, Antoine Chaffin, Vivien Chappelier, Ewa Kijak

The rise of the generative models quality during the past years enabled the generation of edited variations of images at an important scale. To counter the harmful effects of such technology, the Image Difference Captioning (IDC) task aims to describe the differences between two images. While this task is successfully handled for simple 3D rendered images, it struggles on real-world images. The reason is twofold: the training data-scarcity, and the difficulty to capture fine-grained differences between complex images. To address those issues, we propose in this paper a simple yet effective framework to both adapt existing image captioning models to the IDC task and augment IDC datasets. We introduce BLIP2IDC, an adaptation of BLIP2 to the IDC task at low computational cost, and show it outperforms two-streams approaches by a significant margin on real-world IDC datasets. We also propose to use synthetic augmentation to improve the performance of IDC models in an agnostic fashion. We show that our synthetic augmentation strategy provides high quality data, leading to a challenging new dataset well-suited for IDC named Syned1.

PyLate: Flexible Training and Retrieval for Late Interaction Models

Antoine Chaffin, Raphaël Sourty

Neural ranking has become a cornerstone of modern information retrieval. While single vector search remains the dominant paradigm, it suffers from the shortcoming of compressing all the information into a single vector. This compression leads to notable performance degradation in out-of-domain, long-context, and reasoning-intensive retrieval tasks. Multi-vector approaches pioneered by ColBERT aim to address these limitations by preserving individual token embeddings and computing similarity via the MaxSim operator. This architecture has demonstrated superior empirical advantages, including enhanced out-of-domain generalization, long-context handling, and performance in complex retrieval scenarios. Despite these compelling empirical results and clear theoretical advantages, the practical adoption and public availability of late interaction models remain low compared to their single-vector counterparts, primarily due to a lack of accessible and modular tools for training and experimenting with such models. To bridge this gap, we introduce PyLate, a streamlined library built on top of Sentence Transformers to support multi-vector architectures natively, inheriting its efficient training, advanced logging, and automated model card generation while requiring minimal code changes to code templates users are already familiar with. By offering multi-vector-specific features such as efficient indexes, PyLate aims to accelerate research and real-world application of late interaction models, thereby unlocking their full potential in modern IR systems. Finally, PyLate has already enabled the development of state-of-the-art models, including GTE-ModernColBERT and Reason-ModernColBERT, demonstrating its practical utility for both research and production environments.

BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP

Thomas Sounack, Joshua Davis, Brigitte Durieux, Antoine Chaffin, Tom J. Pollard, Eric Lehman, Alistair E. W. Johnson, Matthew McDermott, Tristan Naumann, Charlotta Lindvall

Encoder-based transformer models are central to biomedical and clinical Natural Language Processing (NLP), as their bidirectional self-attention makes them well-suited for efficiently extracting structured information from unstructured text through discriminative tasks. However, encoders have seen slower development compared to decoder models, leading to limited domain adaptation in biomedical and clinical settings. We introduce BioClinical ModernBERT, a domain-adapted encoder that builds on the recent ModernBERT release, incorporating long-context processing and substantial improvements in speed and performance for biomedical and clinical NLP. BioClinical ModernBERT is developed through continued pretraining on the largest biomedical and clinical corpus to date, with over 53.5 billion tokens, and addresses a key limitation of prior clinical encoders by leveraging 20 datasets from diverse institutions, domains, and geographic regions, rather than relying on data from a single source. It outperforms existing biomedical and clinical encoders on four downstream tasks spanning a broad range of use cases. We release both base (150M parameters) and large (396M parameters) versions of BioClinical ModernBERT, along with training checkpoints to support further research.

[Ettin] Seq vs Seq: An Open Suite of Paired Encoders and Decoders

Orion Weller, Kathryn Ricci, Marc Marone, Antoine Chaffin, Dawn Lawrie, Benjamin Van Durme

The large language model (LLM) community focuses almost exclusively ondecoder-only language models, since they are easier to use for text generation.However, a large subset of the community still uses encoder-only models for taskssuch as classification or retrieval. Previous work has attempted to compare thesearchitectures, but is forced to make comparisons with models that have differentnumbers of parameters, training techniques, and datasets. We introduce the SOTAopen-data ETTIN1suite of models: paired encoder-only and decoder-only modelsranging from 17 million parameters to 1 billion, trained on up to 2 trillion tokens.Using the same recipe for both encoder-only and decoder-only models producesSOTA recipes in both categories for their respective sizes, beating ModernBERT asan encoder and Llama 3.2 and SmolLM2 as decoders. Like previous work, we findthat encoder-only models excel at classification and retrieval tasks while decodersexcel at generative tasks. However, we show that adapting a decoder model toencoder tasks (and vice versa) through continued training is subpar compared tousing only the reverse objective (i.e. a 400M encoder outperforms a 1B decoder onMNLI, and vice versa for generative tasks). We open-source all artifacts of thisstudy including training data, training order segmented by checkpoint, and 200+checkpoints to allow future work to analyze or extend all aspects of training.

[ModernBERT] Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, Iacopo Poli

Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements to BERT since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations to encoder-only models and representing a major Pareto improvement over older encoders. Trained on 2 trillion tokens with a native 8192 sequence length, ModernBERT models exhibit state-of-the-art results on a large pool of evaluations encompassing diverse classification tasks and both single and multi-vector retrieval on different domains (including code). In addition to strong downstream performance, ModernBERT is also the most speed and memory efficient encoder and is designed for inference on common GPUs.

Publications de LightOn

ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models

How to Train Your Long-Context Visual Document Model

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

MCPyLate: MCP server using PyLate models for multi-vector search, PLAID.

FastPlaid: A High-Performance Engine for Multi-Vector Search

Reason-ModernColBERT: A late interaction model trained on the reasoning tasks

Pylate-rs: A Rust implementation of Pylate

Reframing Image Difference Captioning with BLIP2IDC and Synthetic Augmentation

PyLate: Flexible Training and Retrieval for Late Interaction Models

BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP

[Ettin] Seq vs Seq: An Open Suite of Paired Encoders and Decoders

[ModernBERT] Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

No matching results found