search
Issues a raw lexical query, retrieves up to 1000 ranked documents, caches the ranking, and shows only the first few excerpts.
Deep Research · Lexical Retrieval · Search Agents
Will a lexical retriever suffice as large language models become stronger in an agentic loop?
University of Waterloo · Stencilzeit
Abstract
This question naturally arises when building deep research systems. We revisit it by pairing BM25 with frontier LLMs that have strong reasoning and tool-use abilities.
To support researchers asking the same question, we introduce PI-SERINI, a search agent equipped with three tools for retrieving, browsing, and reading documents.
Our results show that, on BrowseComp-Plus and with strong LLMs, a lexical retriever can be sufficient for effective deep research when it is well-configured and used with sufficient retrieval depth. Specifically, PI-SERINI with gpt-5.5 achieves 83.1% answer accuracy and 94.7% surfaced evidence recall, outperforming released search agents that use dense retrievers.
Ablations further show that a well-configured BM25 improves answer accuracy by 18.0% and surfaced evidence recall by 11.1%, while increasing retrieval depth further improves surfaced evidence recall by 25.3%. Source code is available on GitHub.
Method
PI-SERINI isolates the interaction between the LLM agent and the retriever. The retriever surfaces a deep BM25 ranking, while the controller controls caching, pagination, and how evidence is revealed.
search · browse · read
k1 = 25, b = 1
Blue components belong to PI-SERINI; the retrieval controller is the main isolation point between the agent and search engine.
searchIssues a raw lexical query, retrieves up to 1000 ranked documents, caches the ranking, and shows only the first few excerpts.
read_search_resultsBrowses a cached ranking by rank offset without launching a new backend query, allowing broader inspection at low context cost.
read_documentReads selected documents through line-based pagination so the agent can inspect relevant text pieces rather than full documents.
Results
PI-SERINI improves answer quality, evidence recall, and cost efficiency across multiple LLM families. The main comparison below focuses on the BrowseComp-Plus full benchmark.
Hover, focus, or filter systems to inspect the paper’s reported benchmark results.
| Source | LLM | Retriever | Answer Quality | Surfaced Recall | Previewed Recall | Behavior Recall | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Acc. | Calib. | Cost | Evi. | Gold | Evi. | Gold | Evi. | Gold | |||
| Chen et al. (2025) | o3 | bm25 | 50.84 | 39.09 | $836.35 | 56.64 | 61.69 | - | - | - | - |
| Chen et al. (2025) | o3 | qwen3-embed-8b | 66.27 | 32.73 | $740.79 | 73.24 | 76.30 | - | - | - | - |
| Chen et al. (2025) | gpt-5 | bm25 | 58.31 | 13.53 | $400.36 | 61.70 | 66.49 | - | - | - | - |
| Chen et al. (2025) | gpt-5 | qwen3-embed-8b | 73.01 | 9.72 | $360.71 | 78.98 | 81.34 | - | - | - | - |
| Meng et al. (2026) | gpt-5.2 | qwen3-embed-8b | 45.10 | - | $1k-$2k | - | 74.70 | - | - | - | - |
| Chen et al. (2026) | Tongyi-DR | AgentIR-4B | 68.07 | - | - | 79.21 | - | - | - | - | - |
| PI-SERINI | claude-haiku-4.5 | bm25 | 54.82 | 17.33 | $193.50 | 94.06 | 95.35 | 58.19 | 60.08 | 40.85 | 46.31 |
| PI-SERINI | claude-opus-4.7 | bm25 | 69.76 | 10.17 | $246.57 | 81.19 | 86.78 | 43.27 | 52.46 | 30.44 | 43.04 |
| PI-SERINI | gpt-5 | bm25 | 74.58 | 7.19 | $94.92 | 90.35 | 93.82 | 62.66 | 70.02 | 45.74 | 56.16 |
| PI-SERINI | gpt-5.2 | bm25 | 70.48 | 6.24 | $122.22 | 89.89 | 92.83 | 60.54 | 67.26 | 44.84 | 54.48 |
| PI-SERINI | gpt-5.4 | bm25 | 73.25 | 9.20 | $175.46 | 93.78 | 95.32 | 70.33 | 66.86 | 51.79 | 58.07 |
| PI-SERINI | gpt-5.5 | bm25 | 83.13 | 15.73 | $291.55 | 94.70 | 94.43 | 73.55 | 72.87 | 58.87 | 56.12 |
| PI-SERINI | deepseek-v4-flash | bm25 | 68.07 | 15.52 | $28.92 | 94.48 | 95.74 | 67.87 | 69.92 | 55.17 | 60.56 |
| PI-SERINI | deepseek-v4-pro | bm25 | 71.43 | 6.95 | $55.08 | 91.32 | 92.46 | 59.85 | 62.97 | 45.41 | 50.56 |
Key Findings
With tuned BM25 parameters and deeper retrieval, PI-SERINI surfaces high-recall candidate evidence before the agent begins deeper inspection.
With search, read_search_results, and read_document, PI-SERINI caches a retrieved ranking, explores deeper retrieval results, and selectively decides which evidence enters its context window.
Surfaced recall is much higher than previewed and behavior recall, suggesting that future gains may come from better evidence navigation rather than only better retrievers.
Time-budget steering and cache-friendly interaction make repeated full-benchmark evaluations more practical for Deep Research systems.
Citation
Use the arXiv citation below.
@misc{hsu2026rethinkingagenticsearchpiserini,
title = {Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?},
author = {Tz-Huan Hsu and Jheng-Hong Yang and Jimmy Lin},
year = {2026},
eprint = {2605.10848},
archivePrefix = {arXiv},
primaryClass = {cs.IR},
url = {https://arxiv.org/abs/2605.10848}
}