Deep Research · Lexical Retrieval · Search Agents

Pi-serini: Rethinking the Retrieval
Interface for Agentic Search

Will a lexical retriever suffice as large language models become stronger in an agentic loop?

Tz-Huan Hsu · Jheng-Hong Yang · Jimmy Lin

University of Waterloo · Stencilzeit

Deeper Ranking Navigation Three tools help the agent cache rankings, browse deeper results, and read selected documents.
BM25 Revisited Long-document tuning and deeper cached rankings expose strong lexical retrieval capacity.
Cost-Aware Evaluation Prefix-cache-friendly interaction reduce expensive evaluation runs.

Abstract

Will a lexical retriever suffice as LLMs become stronger in an agentic loop?

This question naturally arises when building deep research systems. We revisit it by pairing BM25 with frontier LLMs that have strong reasoning and tool-use abilities.

To support researchers asking the same question, we introduce PI-SERINI, a search agent equipped with three tools for retrieving, browsing, and reading documents.

Our results show that, on BrowseComp-Plus and with strong LLMs, a lexical retriever can be sufficient for effective deep research when it is well-configured and used with sufficient retrieval depth. Specifically, PI-SERINI with gpt-5.5 achieves 83.1% answer accuracy and 94.7% surfaced evidence recall, outperforming released search agents that use dense retrievers.

Ablations further show that a well-configured BM25 improves answer accuracy by 18.0% and surfaced evidence recall by 11.1%, while increasing retrieval depth further improves surfaced evidence recall by 25.3%. Source code is available on GitHub.

Method

A deliberately small interface for evidence acquisition.

PI-SERINI isolates the interaction between the LLM agent and the retriever. The retriever surfaces a deep BM25 ranking, while the controller controls caching, pagination, and how evidence is revealed.

Question BrowseComp-Plus query
System Prompt Appendix A.1 retrieval workflow
Time Budget Steer to answer or block tools
LLM Agent ReAct loop
Final Answer Fixed response format
Retrieval Controller Main isolation point
Tool API search · browse · read
State Management Cache · pagination · spill files
Search Engine Anserini BM25 k1 = 25, b = 1

Blue components belong to PI-SERINI; the retrieval controller is the main isolation point between the agent and search engine.

01

search

Issues a raw lexical query, retrieves up to 1000 ranked documents, caches the ranking, and shows only the first few excerpts.

02

read_search_results

Browses a cached ranking by rank offset without launching a new backend query, allowing broader inspection at low context cost.

03

read_document

Reads selected documents through line-based pagination so the agent can inspect relevant text pieces rather than full documents.

Results

BM25 is not the bottleneck when the agent can use it well.

PI-SERINI improves answer quality, evidence recall, and cost efficiency across multiple LLM families. The main comparison below focuses on the BrowseComp-Plus full benchmark.

83.13% Best accuracy with GPT-5.5 + BM25
94.70% Evidence surfaced recall
$291.55 Full run with GPT-5.5
$28.92 DeepSeek flash cost-performance run

Accuracy–Cost Trade-off

Hover, focus, or filter systems to inspect the paper’s reported benchmark results.

PI GPT-5.5 83.13% accuracy at $291.55 with BM25. Best accuracy in this comparison.

Full Results

Source LLM Retriever Answer Quality Surfaced Recall Previewed Recall Behavior Recall
Acc. Calib. Cost Evi. Gold Evi. Gold Evi. Gold
Chen et al. (2025) o3 bm25 50.8439.09$836.35 56.6461.69----
Chen et al. (2025) o3 qwen3-embed-8b 66.2732.73$740.79 73.2476.30----
Chen et al. (2025) gpt-5 bm25 58.3113.53$400.36 61.7066.49----
Chen et al. (2025) gpt-5 qwen3-embed-8b 73.019.72$360.71 78.9881.34----
Meng et al. (2026) gpt-5.2 qwen3-embed-8b 45.10-$1k-$2k -74.70----
Chen et al. (2026) Tongyi-DR AgentIR-4B 68.07-- 79.21-----
PI-SERINI claude-haiku-4.5 bm25 54.8217.33$193.50 94.0695.3558.1960.0840.8546.31
PI-SERINI claude-opus-4.7 bm25 69.7610.17$246.57 81.1986.7843.2752.4630.4443.04
PI-SERINI gpt-5 bm25 74.587.19$94.92 90.3593.8262.6670.0245.7456.16
PI-SERINI gpt-5.2 bm25 70.486.24$122.22 89.8992.8360.5467.2644.8454.48
PI-SERINI gpt-5.4 bm25 73.259.20$175.46 93.7895.3270.3366.8651.7958.07
PI-SERINI gpt-5.5 bm25 83.1315.73$291.55 94.7094.4373.5572.8758.8756.12
PI-SERINI deepseek-v4-flash bm25 68.0715.52$28.92 94.4895.7467.8769.9255.1760.56
PI-SERINI deepseek-v4-pro bm25 71.436.95$55.08 91.3292.4659.8562.9745.4150.56

Key Findings

The bottleneck shifts from retrieval to evidence navigation.

Lexical retrieval is stronger than prior baselines suggest.

With tuned BM25 parameters and deeper retrieval, PI-SERINI surfaces high-recall candidate evidence before the agent begins deeper inspection.

Tool granularity enables deeper retrieval.

With search, read_search_results, and read_document, PI-SERINI caches a retrieved ranking, explores deeper retrieval results, and selectively decides which evidence enters its context window.

Navigation remains difficult.

Surfaced recall is much higher than previewed and behavior recall, suggesting that future gains may come from better evidence navigation rather than only better retrievers.

Evaluation cost can be reduced.

Time-budget steering and cache-friendly interaction make repeated full-benchmark evaluations more practical for Deep Research systems.

Citation

Cite PI-SERINI

Use the arXiv citation below.

@misc{hsu2026rethinkingagenticsearchpiserini,
  title         = {Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?},
  author        = {Tz-Huan Hsu and Jheng-Hong Yang and Jimmy Lin},
  year          = {2026},
  eprint        = {2605.10848},
  archivePrefix = {arXiv},
  primaryClass  = {cs.IR},
  url           = {https://arxiv.org/abs/2605.10848}
}