Research And Development

R&D Overview

Advancing Generative AI through Innovation

The R&D team at LightOn plays a pivotal role in advancing the field of generative AI through continuous innovation and development. Their expertise spans across creating and fine-tuning large language models (LLMs) that form the backbone of the Paradigm platform, a comprehensive AI solution designed for enterprise use. This platform simplifies the integration of generative AI into business workflows, offering both on-premise and cloud options to ensure flexibility and scalability for various business needs.

r&d publications

Recent R&D Posts

Read post

LightOn Releases GTE-ModernColBERT, First State-of-the-Art Late-Interaction Model Trained on PyLate!

LightOn is proud to announce the release of GTE-ModernColBERT, our new state-of-the-art, open-source, multi-vector retrieval model. By leveraging ModernBERT architecture and our innovative PyLate library, we've created a solution that sets a new milestone in the field and addresses the complex challenges of modern enterprise information retrieval.

April 30, 2025

CTA Title

Lorem Ipsum

Read post

Finally, a Replacement for BERT

This blog post introduces ModernBERT, a family of state-of-the-art encoder-only models representing improvements over older generation encoders across the board.

December 19, 2024

CTA Title

Lorem Ipsum

Read post

MonoQwen-Vision, the first visual document reranker

We introduce MonoQwen2-VL-v0.1, the first visual document reranker to enhance the quality of the retrieved visual documents and take these pipelines to the next level. Reranking a small number of candidates with MonoQwen2-VL-v0.1 achieve top results on the ViDoRe leaderboard.

November 7, 2024

CTA Title

Lorem Ipsum

Read post

FC-AMF-OCR Dataset : LightOn releases a 9.3 million images OCR dataset to improve real world document parsing

With over 9.3 million annotated images, this dataset offers researchers and AI developers a valuable resource for creating models adapted to real world documents.

September 20, 2024

CTA Title

Lorem Ipsum

Read post

PyLate: Flexible Training and Retrieval for ColBERT Models

We release PyLate, a new user-friendly library for training and experimenting with ColBERT models, a family of models that exhibit strong retrieval capabilities on out-of-domain data.

August 29, 2024

CTA Title

Lorem Ipsum

Read post

ArabicWeb24: Creating a high quality Arabic Web-only pre-training dataset

August 7, 2024

CTA Title

Lorem Ipsum

Read post

Transforming LLMs into Agents for Enterprise Automation

Developing Agentic Capabilities for LLMs to automate business workflows and create smart assistants.

June 25, 2024

CTA Title

Lorem Ipsum

Read post

Passing the Torch: Training a Mamba Model for Smooth Handover

We present our explorations on training language models based on the new Mamba architecture, which deviates from the traditional Transformer architecture.

April 10, 2024

CTA Title

Lorem Ipsum

Read post

LightOn AI Meetup Creating a Large Dataset for Pretraining LLMs

March 22, 2024

CTA Title

Lorem Ipsum

Read post

Explore Publications by LightOn

publications

R&D Overview

Advancing Generative AI through Innovation

Recent R&D Posts

LightOn Releases GTE-ModernColBERT, First State-of-the-Art Late-Interaction Model Trained on PyLate!

CTA Title

Finally, a Replacement for BERT

CTA Title

MonoQwen-Vision, the first visual document reranker

CTA Title

FC-AMF-OCR Dataset : LightOn releases a 9.3 million images OCR dataset to improve real world document parsing

CTA Title

PyLate: Flexible Training and Retrieval for ColBERT Models

CTA Title

ArabicWeb24: Creating a high quality Arabic Web-only pre-training dataset

CTA Title

Transforming LLMs into Agents for Enterprise Automation

CTA Title

Passing the Torch: Training a Mamba Model for Smooth Handover

CTA Title

LightOn AI Meetup Creating a Large Dataset for Pretraining LLMs

CTA Title

Explore Publications by LightOn

R&D Overview