LightOn Demonstrates the Flexibility of Its OCR Model by Adapting It to Arabic Through Targeted Training

LightOn demonstrates the flexibility of LightOnOCR-2, its document understanding model, by adapting it to Arabic through fine-tuning.

June 12, 2026

TL;DR

PARIS, [12 June 2026] - LightOn demonstrates the flexibility of LightOnOCR-2, its document understanding model, by adapting it to Arabic through fine-tuning. This extension is based on an internal synthetic data generation pipeline designed to cover languages that remain underrepresented in the OCR tools currently available on the market.

This demonstration is based on a dataset comprising 12,000 synthetic pages and their reference transcriptions, produced using a modified version of LightOn’s synthetic document generator.

The corpus covers a wide range of document scenarios, including scanning artifacts, font variations, resolution levels, and document types. The output format remains the one used to train the bbox variant of LightOnOCR-2, with bounding box detection that associates text with its spatial location.

Applying OCR to Arabic presents specific challenges. The script is written from right to left, characters are connected in cursive form, and open datasets, like specialized models, remain less widely available than for Latin-based languages. For organizations processing archives, administrative, legal, or heritage documents in Arabic, these limitations can slow the automation of document workflows.

This demonstration is part of a broader effort to extend the model to a variety of domains, as reflected by its more than 3 million downloads and the fine-tunings already carried out by the community. It notably addresses needs encountered in the Middle East, where LightOn is already working with public- and private-sector organizations. This development is consistent with LightOn’s positioning: providing enterprise-grade generative AI building blocks that are open, controllable, and tailored to sensitive environments.

LightOn is making the guides needed to reproduce this fine-tuning available on its Hugging Face space, with the aim of making this approach accessible to as many users as possible and adaptable to other document contexts.

LightOnOCR-2 is released as open source under the Apache 2.0 license. It plays a central role in the production document ingestion process within LightOn Console, LightOn’s self-service offering.
The open model and our production engine are therefore built on the same technological foundation. The base model achieves a score of 83.2% on OlmOCR-Bench.

Ready to Transform Your Enterprise?

TL;DR

Recent Blogs

Introducing LightOn-rerank: SOTA multimodal LLM reranker

Faceted search has entered the agentic era

Fewer Tokens, Better Answers: Give Your Agent Search, Not Raw Files

Ready to Transform Your Enterprise?