TL;DR
DuckSearch
DuckSearch is a lightweight Python library designed for document search. Built on top of DuckDB, DuckSearch is a great fit if you're working with large datasets and need to search and filter results with a lexical index.
Key Features
- Indexing and searching in Hugging Face datasets directly from the dataset URL.
- Advanced filtering features with SQL filters.
- Streaming-friendly addition, deletion, and updates of documents.
- Minimal set of requirements.
- Efficient batching and multi-processing.
Install DuckSearch
Hugging Face dataset
One of the key features is the ability to seamlessly index datasets from Hugging Face. The following code uploads 3K samples from the FineWeb dataset and creates a dedicated BM25 index stored in the fineweb.duckdb
file. We specify the fields we want to use for the search.
Once we have uploaded our dataset, we can start searching:
Let's search for samples in FineWeb that mention "island", published after 2010, and that contain "tourism" in their URL.
Upload documents
We can upload documents into the index where documents are a list of Python dictionaries.
For example, let's search for science fiction movies released after 2000 with a rating higher than 8.5.
DuckSearch supports advanced filtering via SQL expressions. Since it uses DuckDB, we can leverage its full range of SQL functions to filter search results.
Delete documents
After removing documents, we can upload new or updated documents. The BM25 index will automatically update to reflect these changes.
End note
We designed DuckSearch at LightOn to improve the training of our information retrieval language models via negative-mining. The license is MIT.
For more information and detailed usage examples, you can visit the documentation or the GitHub repository.
DuckDB is an in-process SQL OLAP database management system built for high-performance analytical workloads. It supports parallel query execution, advanced indexing, and memory-efficient processing, making it great for scalable document search and filtering on large datasets.