PyLate: Flexible Training and Retrieval for ColBERT Models

We release PyLate, a new user-friendly library for training and experimenting with ColBERT models, a family of models that exhibit strong retrieval capabilities on out-of-domain data.

August 29, 2024

TL;DR

We are releasing PyLate, a new user-friendly library for training and experimenting with ColBERT models, a family of models that exhibit strong retrieval capabilities on out-of-domain data. Built on the sentence transformers framework, the library is designed to make training and experimenting with ColBERT models more accessible to researchers and practitioners to accelerate the research in this domain.

ColBERT models are strong out-of-domain retrievers

An important part of retrieval-augmented generation tools such as Paradigm is searching in the user documents collection to find relevant information to answer a query.
A popular architecture to search for relevant documents in large databases is bi-encoders that pre-compute the representation of all the documents to index them. Then, to perform a search, the user query is encoded and the search for the most similar documents is performed by computing the similarity of the query with the pre-computed representation of the documents with a lightweight similarity computation. This setup enables quick search even in large databases.
ColBERT models are another type of encoder also used for retrieval tasks. Unlike traditional bi-encoders that pool all the token representations into a single one, ColBERT models retain all token representations and use late interaction (MaxSim) to compute query/document similarity.

Illustration of the late interaction mechanism used by ColBERT models. Credit ColBERT paper.

This subtle yet crucial difference leads to strong results and impressive generalization to Out-Of-Domain (OOD) data, i.e, it works well on data different from what the model has been trained on. This is a very important property because it is unrealistic to consider that the training data will cover all the possible domains the model will be used on and poor OOD generalization are the root of a lot of failure cases. Thus, given that the customer data is most likely unique and is usually not leveraged to train the models, having strong OOD performance is crucial to build a pipeline that will work on all the users data.

Introducing PyLate

Compared to the growing interest in late interaction models, there have been few research and released models in this area.

We are releasing PyLate, a library that aims to fill this gap by providing a modular library built on the widely-used sentence-transformers framework, making it both accessible and familiar to many users.

PyLate comes with various features:

An interface that is accessible for newcomers and familiar to experienced users
Support for any triplet dataset available on Hugging Face for contrastive learning as well as any encoder as the base model
Training of the model using knowledge distillation to get the best performance
Multi-GPU and FP16/BF16 training with Weights & Biases logging for efficient and trackable training
HNSW index to store and query the documents as well as reranking first-stage retrieval pipeline results without index
Support of document tokens pooling to reduce the memory footprint of the index without degrading performance

We plan to release various strong late interaction (English and multilingual) models in the near future using PyLate and integrate them in Paradigm to enhance the quality of our retrieval pipeline on our customer data.

If you are interested in building and exploring this promising line of research, you will find documentation and examples on the Github repository: https://github.com/lightonai/pylate

We already have many features planned for the future but we also welcome contributions, so feel free to check the contribution guide and help us build better retrieval models!
‍

Ready to Transform Your Enterprise?

TL;DR

ColBERT models are strong out-of-domain retrievers

Introducing PyLate

Recent Blogs

Sodern, a subsidiary of ArianeGroup, Selects LightOn to Industrialize Generative AI Within a Secure Framework

Announcing BioClinical ModernBERT: a new SOTA encoder model for Medical NLP

LightOn Unlocks Agentic RAG with new SOTA Model Reason-ModernColBERT

Ready to Transform Your Enterprise?