EDiTh: Sharing the Data Nobody Shares

Meet EDiTh: a synthetic enterprise corpus built to benchmark what public datasets can't touch: 1,004 PDFs in six languages and 36 ground-truth use cases.

April 27, 2026

TL;DR

EDiTh is an open benchmark for enterprise document retrieval, released today on Hugging Face. It is built around Véracier Industries, a fictional €1.8B French industrial group whose document estate reproduces the multi-entity, multi-language, multi-format complexity of a real multinational.

Dataset: https://huggingface.co/datasets/lightonai/veracier-industries‍
1,004 unique PDFs · 1.7 GB · 6 languages · 3 formats‍
36 use cases with full answer keys, organized by stakeholder role

Why this benchmark exists

Important Notice

This dataset does not contain real documents. EDiTh is a fully synthetic corpus created for research and demonstration of retrieval capabilities only. Véracier Industries S.A., its subsidiaries (including Précis-Tec S.A.), employees, contracts, certifications, and all associated entities are fictional. Any resemblance to real companies, individuals, or agreements is coincidental. The dataset is intended solely as a benchmark for evaluating enterprise search and RAG systems. It must not be used as a source of factual information, legal reference, or business intelligence.

The Edith dataset exists to show customers what a modern retrieval pipeline actually delivers: not search results, but decisions grounded in their own documents. Our goal is for executives and management to get answers to their hardest questions, the kind that usually require a team, a week, and a stack of PDFs to resolve.

There's a structural reason this dataset had to exist. The documents that matter most inside a company, contracts, technical specs, regulatory filings, internal memos, are precisely the ones an ISV can never see. And the sensitivity isn't about the content of any single document: most are individually innocuous. The sensitivity comes from aggregation and context. A procurement memo, a supplier list, and a maintenance schedule are each unremarkable on their own; together, shared with an outside vendor for a proof of concept, they describe the operational posture of a business. That is why, in practice, almost nothing leaves the firewall. Demonstrating that a retrieval system works on this kind of content has traditionally required being embedded inside the customer, running pilots under NDA, and doing substantial integration work before anyone can judge whether the system is any good. That asymmetry is why public benchmarks have drifted toward academic corpora and generic QA: they're what's available, not what's representative.

Edith takes a different route. Rather than wait for access that will never come, we used Claude to generate a corpus of realistic enterprise documents, the kind of material that typically sits behind a firewall, along with the difficult questions an executive or analyst would actually ask of it, and the answers grounded in those documents. The result is a benchmark that looks like work, not like a test set. It lets us demonstrate retrieval quality on content that reflects how companies actually operate, without needing an NDA to do it.

Retrieval systems are typically benchmarked against academic metrics that measure ranking quality in the abstract. Those numbers rarely capture what matters inside a company: whether someone in finance, operations, or the executive team can move from a question to a defensible answer without leaving the document corpus. Edith closes that gap, reframing retrieval as what it should be: infrastructure for document-to-decision reasoning, not another leaderboard entry.

The deeper point is that the same logic that forced Edith to be synthetic is the logic that shapes how Paradigm is deployed. If a handful of mundane documents become sensitive the moment they aggregate, then the system reasoning over the full corpus can't sit outside the perimeter either. Document-to-decision infrastructure has to run where the documents already live, inside the customer's environment, under the customer's control. Edith lets us prove the capability in public. Paradigm is how that capability shows up in production, without anything ever having to leave.

What you get

Today's release consists of three components.

The document corpus. 1,004 PDFs spanning seven subsidiaries across five countries, with a further 320 documents inherited through the in-progress Précis-Tec acquisition. The corpus mirrors the composition observed in enterprise document stores: a majority of searchable PDFs, a significant minority of scanned files including handwritten annotations and visual degradation, and a mixed subset combining both. Six languages are represented, with substantial French, English, and German content and a smaller volume of bilingual, Italian, and Spanish material.

The use cases. 36 scenarios covering strategic, legal, regulatory, quality, defense, supply chain, and HR workflows. Each use case specifies a persona, a question, the expected retrieval set, the ground-truth answer, and the failure modes under evaluation. Use cases are designed to exercise capabilities that generic benchmarks do not measure: cross-entity reasoning, temporal reasoning, terminology variance across jurisdictions, and resilience to scanned and degraded inputs.

Dataset at a glance

Unique PDFs	1,004
Total disk size	1.7 GB
Use cases	36
Languages	6
Noise documents	5.5%
Answer keys	Full, for all 36 use cases

The scanned subset includes handwritten annotations, stamp marks, multi-column layouts, and visual degradation consistent with documents that have been photocopied across multiple generations. Each scanned document was rendered and, where applicable, physically printed and rescanned to produce realistic artifacts.

The bilingual subset matters more than its percentage suggests. These are real bilingual documents: French supplier contracts with English governing-law clauses, German procedural documents with English technical appendices. The kind of thing every European multinational has, and nothing in current public RAG benchmarks represents.

How it's organized

The dataset is structured by stakeholder role. Each use case maps to a defined persona inside the Véracier universe: the executive who would own the question, the document set they would typically consult, and the failure modes specific to their context.

Nine roles are represented: CEO, CFO, CTO, General Counsel, CISO, CHRO, CPO, Quality Director, and Compliance Officer. Each role carries a different retrieval profile. The CFO use cases require temporal scope reasoning and threshold filtering. The CISO use cases require security classification extraction across IT asset inventories. The General Counsel use cases require cross-language clause detection, including exclusions that look like coverage until read carefully.

The baseline run

The reference results were produced using a single configuration: LightOn API as the retrieval and orchestration layer, with Claude Opus 4.6 as the reasoning model. Each use case was processed with a single API call. No external orchestration, fine-tuning, or human-in-the-loop intervention was involved.

Headline

Use cases completed	36 / 36
Total runtime	31 minutes
Average per use case	63 seconds
Output files generated	72
Avg documents retrieved per use case	76

Breakdown by subsidiary

Subsidiary	Use cases	Avg docs / UC
Véracier Industries S.A. (parent)	20	85
Véracier Aéro S.A.S.	2	56
Véracier Défense & Sécurité S.A.S.	2	59
Véracier Energie S.A.S.	2	59
Véracier GmbH	2	71
Véracier UK Ltd	2	65
Véracier Inc.	2	66
Véracier Maroc S.A.R.L.	2	65
Compliance (cross-entity)	2	82
Total	36	76

The 36 Use Cases

Thirty-six use cases. Each one is a question an executive actually brings to a meeting. The kind that usually requires a team, a week, and a stack of PDFs to resolve.

The corpus doesn't cooperate: documents in five languages, scanned files with handwritten amendments, contracts with exclusion clauses that read like coverage until you parse them carefully. Each use case specifies the persona, the question, the expected retrieval set, and the failure modes under evaluation.

A sample:

Persona	Question
CEO	Which inherited Précis-Tec contracts have change-of-control clauses, reference Russian entities, or contain uncommercial terms?
CFO	Which customer contracts have unflagged IFRS 15 revenue recognition implications?
CTO	An AeroValve AV-3000 leaked at Airbus. Which spec revision was used to build that unit in 2020?
General Counsel	Titanium supply crisis. Which supplier contracts have force majeure clauses actually covering supply-chain disruption?
CISO	Which IT systems process Confidentiel Défense or NATO classified information?
CHRO	Which employees have non-compete clauses? What jurisdictional applicability?
CPO	Forges de Bologne entered insolvency. Which open POs are we exposed on? Total exposure?
Quality Director	Is the production file for AeroValve lot 2024-0312 complete for the EASA audit?
Compliance Officer	New EU sanctions on Russian/Belarusian entities. Check all our contracts, payments, partnerships. Report by Monday.

How LightOn API solved the use cases

Each of the 36 use cases was resolved end-to-end by a single LightOn API call. The API handles the full pipeline natively: multi-language retrieval across the corpus, reranking, reading of the relevant document chunks, and synthesis of a structured answer with citations, page references, and confidence scores. No external orchestration, chaining framework, or retrieval code surrounds the call. The integration is a single HTTP request in and a structured response out.

This design is what makes the sub-minute average per use case achievable. The retrieval, reasoning, and citation logic are co-located inside the API, which eliminates the round-trip overhead of separating them across services, and allows the agent to iterate searches and re-read chunks without network latency accumulating between stages. The same API is the interface through which LightOn customers run their own document-to-decision workflows in production.

EDiTh: Sharing the Data Nobody Shares

TL;DR

Why this benchmark exists

What you get

Dataset at a glance

How it's organized

The baseline run

Headline

Breakdown by subsidiary

The 36 Use Cases

How LightOn API solved the use cases

Links

Ready to Transform Your Enterprise?

TL;DR

Why this benchmark exists

What you get

Dataset at a glance

How it's organized

The baseline run

Headline

Breakdown by subsidiary

The 36 Use Cases

How LightOn API solved the use cases

Links

Recent Blogs

The €825,000 you saved before lunch

Deep Research is now Open

Your RAG Pipeline Is Eating Your Roadmap

Ready to Transform Your Enterprise?