TL;DR
The emergence of very capable visual language enabled the creation of OCR-free RAG pipelines. These new pipelines are becoming the norm and are better suited to customers'
. We introduce MonoQwen2-VL-v0.1, the first visual document reranker to enhance the quality of the retrieved visual documents and take these pipelines to the next level. Reranking a small number of candidates with MonoQwen2-VL-v0.1 achieve top results on the ViDoRe leaderboard.
Standard textual pipelines
Traditional Retrieval Augmented Generation (RAG) approaches leverage complex pipelines to process documents the user can query. These extract as much information as possible from the original document format (most often PDFs) in an easily searchable form, most often as text. Such processing includes Optical Character Recognition (OCR) to extract text from embedded paragraphs, layout detection and segmentation to correctly represent the original document hierarchy. Additionally, pictures, graphs and tables captioning are necessary in order to get a glimpse of the original visual elements.
All of these textual elements are then split into different chunks that are searched for similar ones when the user queries the RAG pipeline, and fed to the language model to help generate grounded answers. The reason why documents that are not plain text are cast to text is because text search has been around for a long time and is quite mature, making it the easiest way to get good and efficient retrieval results. Besides, Large Language Models (LLMs) use text as input, so this conversion is needed anyway for the generation step.
Shortcomings of textual pipelines
Yet, all of these processing add up and take a non-negligible amount of time, which can become an issue when scaling the numbers of documents being added regularly or with additional and more sophisticated processing.
In addition to this overhead, converting a document to text is often not lossless: some information is lost in the process that could be very useful in the retrieval and/or the generation step. Small OCR errors may occur on a large block of text but hopefully both the retrieval pipeline and the LLM are quite robust to these. However, it is more challenging to deal with elements that are not textual by nature.
Take the example of a graph; you would need a caption that includes everything, including colors, shapes, values, labels, etc.
The rise of Visual Language Models
Although textual language models have been the focus for the past years, we are getting more capable Visual Language Models (VLMs). These models are very similar to the usual LLMs and work by auto-regressively completing a prompt, that is, predicting the next words in a sequence one after the other. The main difference is that these models can take images as input in addition to the usual text prompts. These models can thus be used to process the input image, for example to extract information from it, or reason and answer questions.
This means that we can now feed a visually rich input directly into the VLM and ask a question about it without having to go through a lengthy and lossy processing first, thereby enhancing the quality of the answer for elements that are not pure text. Although these models still perform worse on pure text tasks, there is no doubt that they will eventually catch up very soon as they draw more attention.
Visual retrieval
Yet, a piece was still lacking in enabling purely visual RAG pipelines. Indeed, while the VLMs solve the Generation part of RAG, the Retrieval part is still required to extract text to be able to retrieve the elements to be fed to the VLMs.
This issue has been solved by the introduction of ColPali and Document Screenshot Embeddings (DSE). Without going into much detail (refer to this blogpost for more in depth explanation), these approaches leverage VLMs to create embeddings, i.e continuous representations (vectors) of the documents from their visually-rendered version, as well as textual queries. The retrieval model is trained so that the vectors of associated document page/query pairs are close, while the unrelated ones are far.
This allows searching for similar pages using traditional neighbor search as in text-only pipelines and benefiting from all the perks of these approaches (fast search, generalization, …).
These enabled the creation of fully visual pipelines that never cast the documents as text, while at the same time allowing high quality retrieval and answering complex queries on visual content, as illustrated in our demo, in a way that is seamlessly equivalent to the user's familiar experience.
The missing part: the reranker
However, compared to existing, text-based pipelines, an essential part is still missing: the reranking.
Indeed, whether it’s DSE simple dot products or ColPali's late interactions, the queries and the documents are encoded separately, without leveraging each other’s information to represent the elements. This allows the representations of all the documents in a given database to be precomputed, thus enabling to search very quickly across a large pool of documents. Encoding the pair together allows for a more accurate representation, as it is easier to encode the relevant information in a document for a given query knowing this query. However, this approach is not scalable, as it would require the representation of the query to be computed with every element in the database during the query, which is intractable once the number of documents is reasonable (especially given the acceptable latency for a user).
Thus, a two-stage pipeline is often used to achieve a trade-off between performance and efficiency. Separate encoding is first used to generate a smaller pool of plausible candidates, which are then scored using a more expensive cross-encoding model to produce the final and more accurate ranking of the elements.
This reranking step can be used to feed the elements in the correct order to mitigate the lost-in-the-middle effect (although VLMs seem to be less sensitive to this) or to filter out irrelevant elements using a threshold. This step is thus critical to enhancing the quality of the retrieval step and thus the quality of the generation (garbage in, garbage out).
MonoQwen2-VL-v0.1
At LightOn, we believe OCR-free pipelines will be very impactful in the future, especially for customer data that is often visually rich (lots of graphs and tables). We decided to fill this gap by creating a visual document reranker.
Originally, reranking was done using encoders (such as BERT), that use bidirectional attention to encode the query/document pair, that is, the representation of the query is impacted by the document and vice versa. Although this bidirectionality is expected to yield the best possible results thanks to its huge capacity, another reranking approach (originally called MonoT5) consists in using generative language models to perform the reranking. Given the query/document pair in the prompt, the model is trained to generate the word “True” if the document is relevant to the query and “False” otherwise.
Although the attention of generative models is unidirectional, which is expected to offer less capacity, using them as rerankers comes with different benefits.
First, the generative models have been extensively optimized, that is, trained on very large amount of data for very long time and are thus very strong out-of-the-box, topping the retrieval leaderboards (although the comparison is not very fair given the cost difference). Second, given the very large capacity of language models, we can get strong results by training only a Low-Rank Adaptation (LoRA) on top of the base model. This thus means that we can load only the large model used to perform the end generation tasks and apply different LoRAs to repurpose this model for different tasks. Since the models used for retrieval (ColPali and DSE) are already LoRAs of VLMs, we can add the reranking step using a LoRA without using more memory, compared to using a dedicated separate model.
All of this led to the logical conclusion of creating our very first iteration of this model, MonoQwen2-VL-v0.1, a LoRA of the Qwen2-VL-2B-Instruct model. Qwen2-VL is the backbone used for generation and retrieval in most vision pipelines right now. We train the LoRA to perform reranking using the MonoT5 objective on the ColPali training set.
Results
We evaluate this model on ViDoRe by reranking the top-10 elements retrieved by the DSE model based on Qwen2-VL (reported values are NDCG@5):
These results show that the performance gain from the reranking is substantial and allows new state-of-the-art results to be achieved on the ViDoRe benchmark.
Please note that the results of a reranker are not directly comparable to first-stage retrievers and that both are required to get the best performance possible.
Similarly, the reported results heavily depend on the first-stage candidate pool to be reranked. Using the actual best first-stage retrievers (such as ColQwen) and reranking more than the top-10 candidates would improve the results.
However, we note that the ViDoRe benchmark is already starting to get saturated, which calls for a more challenging benchmark.
Future work
Although these are excellent results, we consider this version to be only a v0.1 of our MonoQwen2-VL reranker, as it is a preliminary release, and the results can easily be pushed further. We are already in the process of training more capable models by leveraging more and better quality data, training for longer, and using more adapted objectives.