DuckSearch: search through Hugging Face datasets

DuckSearch is a lightweight Python library built on DuckDB, designed for efficient document search and filtering with Hugging Face datasets and standard documents.

October 3, 2024

TL;DR

DuckSearch

DuckSearch is a lightweight Python library designed for document search. Built on top of DuckDB, DuckSearch is a great fit if you're working with large datasets and need to search and filter results with a lexical index.

Key Features

Indexing and searching in Hugging Face datasets directly from the dataset URL.
Advanced filtering features with SQL filters.
Streaming-friendly addition, deletion, and updates of documents.
Minimal set of requirements.
Efficient batching and multi-processing.

Install DuckSearch


pip install ducksearch

‍

Hugging Face dataset

One of the key features is the ability to seamlessly index datasets from Hugging Face. The following code uploads 3K samples from the FineWeb dataset and creates a dedicated BM25 index stored in the fineweb.duckdb file. We specify the fields we want to use for the search.


from ducksearch import upload

upload.documents(
    database="fineweb.duckdb",
    key="id",
    fields=["text", "url", "date", "language", "token_count", "language_score"],
    documents="https://huggingface.co/datasets/HuggingFaceFW/fineweb/resolve/main/sample/10BT/000_00000.parquet",
    dtypes={
        "date": "DATE",
        "token_count": "INT",
        "language_score": "FLOAT",
    },
    limit=1000, # demonstrate with a small dataset
)

‍

Once we have uploaded our dataset, we can start searching:


from ducksearch import search

search.documents(
    database="fineweb.duckdb",
    queries=["earth science", "volcanic island"],
    top_k=10,
)

‍

Let's search for samples in FineWeb that mention "island", published after 2010, and that contain "tourism" in their URL.


from ducksearch import search

search.documents(
    database="fineweb.duckdb",
    queries=["island"],
    top_k=10,
    filters="url like '%tourism%' AND YEAR(date) > 2010"
)

Upload documents

We can upload documents into the index where documents are a list of Python dictionaries.


movies = [
    {
        "id": 1,
        "title": "The Matrix",
        "genre": "sci-fi",
        "release_date": "1999-03-31",
        "rating": 8.7,
    },
    {
        "id": 2,
        "title": "Inception",
        "genre": "sci-fi, thriller",
        "release_date": "2010-07-16",
        "rating": 8.8,
    },
    {
        "id": 3,
        "title": "The Godfather",
        "genre": "crime, drama",
        "release_date": "1972-03-24",
        "rating": 9.2,
    },
]


upload.documents(
    database="movies.duckdb",
    key="id",
    fields=["title", "genre", "release_date", "rating"],
    documents=movies,
    dtypes={
        "release_date": "DATE",
        "rating": "FLOAT",
    }
)

‍

For example, let's search for science fiction movies released after 2000 with a rating higher than 8.5.


from ducksearch import search

results = search.documents(
    database="movies.duckdb",
    queries=["sci-fi"],
    top_k=5,
    filters="YEAR(release_date) > 2000 AND rating > 8.5"
)

DuckSearch supports advanced filtering via SQL expressions. Since it uses DuckDB, we can leverage its full range of SQL functions to filter search results.

‍

Delete documents


from ducksearch import delete

delete.documents(
    database="movies.duckdb",
    ids=[1, 2]
)

After removing documents, we can upload new or updated documents. The BM25 index will automatically update to reflect these changes.

End note

We designed DuckSearch at LightOn to improve the training of our information retrieval language models via negative-mining. The license is MIT.

For more information and detailed usage examples, you can visit the documentation or the GitHub repository.

DuckDB is an in-process SQL OLAP database management system built for high-performance analytical workloads. It supports parallel query execution, advanced indexing, and memory-efficient processing, making it great for scalable document search and filtering on large datasets.

‍


@misc{DuckSearch,
  title={DuckSearch: Efficient Search with DuckDB},
  author={Sourty, Raphael},
  url={https://github.com/lightonai/ducksearch},
  year={2024}
}

Ready to Transform Your Enterprise?

TL;DR

DuckSearch

Key Features

Install DuckSearch

Hugging Face dataset

Upload documents

Delete documents

End note

Recent Blogs

Sodern, a subsidiary of ArianeGroup, Selects LightOn to Industrialize Generative AI Within a Secure Framework

Announcing BioClinical ModernBERT: a new SOTA encoder model for Medical NLP

LightOn Unlocks Agentic RAG with new SOTA Model Reason-ModernColBERT

Ready to Transform Your Enterprise?