Sparse Search Benchmark¶

Performance comparison between zerodep sparse_search, rank-bm25, and bm25s.

Test Environment

Implementations¶

Implementation	File/Package	Description
zerodep	`sparse_search.py`	stdlib-only BM25/TF-IDF engine with inverted index
rank-bm25	(reference)	Popular BM25 library backed by numpy
bm25s	(reference)	Fast BM25 library with eager sparse scoring (numpy + scipy)

Queries matching few documents — the inverted index's O(matched_docs) advantage is strongest here.

Corpus Size	zerodep	rank-bm25	bm25s
200 docs	3.1 μs	101.9 μs	28.8 μs
1000 docs	3.0 μs	379.3 μs	26.7 μs

Queries using high-frequency words that match most documents — numpy vectorization dominates.

Corpus Size	zerodep	rank-bm25	bm25s
1K docs	4.44 ms	325.8 μs	44.7 μs
10K docs	55.8 ms	3.87 ms	104.5 μs

Queries using rare words on a realistic vocabulary distribution.

Corpus Size	zerodep	rank-bm25	bm25s
1K docs	17.5 μs	309.4 μs	52.8 μs
10K docs	162.0 μs	3.25 ms	293.3 μs

Corpus Size	zerodep	rank-bm25	bm25s
1000 docs	137.6 ms	13.3 ms	47.7 ms

Search performance depends on query selectivity. For selective queries (few matching documents), the inverted index provides O(matched_docs) traversal — significantly faster than the O(N) full corpus scan used by rank-bm25. For broad queries matching most documents, numpy-backed libraries (bm25s, rank-bm25) are faster due to vectorized array operations.
At small corpus sizes (< 1K docs), all implementations are sub-millisecond — performance differences are negligible for typical use cases like search result reranking or RAG retrieval.
bm25s scales best for large corpora with broad queries thanks to sparse matrix operations. Its search time grows slowly with corpus size.
Indexing speed is fastest for rank-bm25 (simpler data structures); bm25s is moderate; zerodep is slowest due to its 3-level inverted index overhead.
Ranking correctness is validated against rank-bm25 across BM25Okapi, BM25Plus, and BM25L variants with 8 queries — results match in ranking order.
zerodep has zero pip dependencies and provides features not available in rank-bm25 or bm25s: BM25F multi-field search, dynamic add/remove/update, metadata filtering, RRF/MMR, and Bayesian calibration.

pip install pytest pytest-benchmark rank-bm25 bm25s
pytest sparse_search/test_sparse_search_benchmark.py --benchmark-only -v

Updated automatically on each release via Benchmark CI.