Deep Research · Lexical Retrieval · Search Agents

Pi-serini: Rethinking the Retrieval
Interface for Agentic Search

Will a lexical retriever suffice as large language models become stronger in an agentic loop?

Tz-Huan Hsu · Jheng-Hong Yang · Jimmy Lin

University of Waterloo · Stencilzeit

Paper Code BibTeX Results

BrowseComp-Plus

83.13% Answer Accuracy

94.70% Surfaced Evidence Recall

3.3×–10× Lower Full-Run Cost

A well-configured BM25, paired with careful tool design, can be competitive with dense-retriever search agents.

GPT-5 + BM25 comparison

Same model, same retriever PI-SERINI saves $305.44 over Chen et al. on BrowseComp-Plus

PI-SERINI 74.58% Accuracy

Chen et al. 58.31% Accuracy

Answer accuracy +16.27

Full-run cost $94.92 (-$305.44)
$400.36

Surfaced evidence +0.96

Deeper Ranking Navigation Three tools help the agent cache rankings, browse deeper results, and read selected documents.

BM25 Revisited Long-document tuning and deeper cached rankings expose strong lexical retrieval capacity.

Cost-Aware Evaluation Prefix-cache-friendly interaction reduce expensive evaluation runs.

Abstract

Will a lexical retriever suffice as LLMs become stronger in an agentic loop?

This question naturally arises when building deep research systems. We revisit it by pairing BM25 with frontier LLMs that have strong reasoning and tool-use abilities.

To support researchers asking the same question, we introduce PI-SERINI, a search agent equipped with three tools for retrieving, browsing, and reading documents.

Our results show that, on BrowseComp-Plus and with strong LLMs, a lexical retriever can be sufficient for effective deep research when it is well-configured and used with sufficient retrieval depth. Specifically, PI-SERINI with gpt-5.5 achieves 83.1% answer accuracy and 94.7% surfaced evidence recall, outperforming released search agents that use dense retrievers.

Ablations further show that a well-configured BM25 improves answer accuracy by 18.0% and surfaced evidence recall by 11.1%, while increasing retrieval depth further improves surfaced evidence recall by 25.3%. Source code is available on GitHub.

Method

A deliberately small interface for evidence acquisition.

PI-SERINI isolates the interaction between the LLM agent and the retriever. The retriever surfaces a deep BM25 ranking, while the controller controls caching, pagination, and how evidence is revealed.

Question BrowseComp-Plus query

System Prompt Appendix A.1 retrieval workflow

Time Budget Steer to answer or block tools

LLM Agent ReAct loop

Final Answer Fixed response format

Retrieval Controller Main isolation point

Tool API search · browse · read

State Management Cache · pagination · spill files

Search Engine Anserini BM25 k1 = 25, b = 1

Blue components belong to PI-SERINI; the retrieval controller is the main isolation point between the agent and search engine.

`search`

Issues a raw lexical query, retrieves up to 1000 ranked documents, caches the ranking, and shows only the first few excerpts.

`read_search_results`

Browses a cached ranking by rank offset without launching a new backend query, allowing broader inspection at low context cost.

`read_document`

Reads selected documents through line-based pagination so the agent can inspect relevant text pieces rather than full documents.

Results

BM25 is not the bottleneck when the agent can use it well.

PI-SERINI improves answer quality, evidence recall, and cost efficiency across multiple LLM families. The main comparison below focuses on the BrowseComp-Plus full benchmark.

83.13% Best accuracy with GPT-5.5 + BM25

94.70% Evidence surfaced recall

$291.55 Full run with GPT-5.5

$28.92 DeepSeek flash cost-performance run

Accuracy–Cost Trade-off

Hover, focus, or filter systems to inspect the paper’s reported benchmark results.

PI GPT-5.5 83.13% accuracy at $291.55 with BM25. Best accuracy in this comparison.

Source	LLM	Retriever	Answer Quality			Surfaced Recall		Previewed Recall		Behavior Recall
Source	LLM	Retriever	Acc.	Calib.	Cost	Evi.	Gold	Evi.	Gold	Evi.	Gold
Chen et al. (2025)	o3	bm25	50.84	39.09	$836.35	56.64	61.69	-	-	-	-
Chen et al. (2025)	o3	qwen3-embed-8b	66.27	32.73	$740.79	73.24	76.30	-	-	-	-
Chen et al. (2025)	gpt-5	bm25	58.31	13.53	$400.36	61.70	66.49	-	-	-	-
Chen et al. (2025)	gpt-5	qwen3-embed-8b	73.01	9.72	$360.71	78.98	81.34	-	-	-	-
Meng et al. (2026)	gpt-5.2	qwen3-embed-8b	45.10	-	$1k-$2k	-	74.70	-	-	-	-
Chen et al. (2026)	Tongyi-DR	AgentIR-4B	68.07	-	-	79.21	-	-	-	-	-
PI-SERINI	claude-haiku-4.5	bm25	54.82	17.33	$193.50	94.06	95.35	58.19	60.08	40.85	46.31
PI-SERINI	claude-opus-4.7	bm25	69.76	10.17	$246.57	81.19	86.78	43.27	52.46	30.44	43.04
PI-SERINI	gpt-5	bm25	74.58	7.19	$94.92	90.35	93.82	62.66	70.02	45.74	56.16
PI-SERINI	gpt-5.2	bm25	70.48	6.24	$122.22	89.89	92.83	60.54	67.26	44.84	54.48
PI-SERINI	gpt-5.4	bm25	73.25	9.20	$175.46	93.78	95.32	70.33	66.86	51.79	58.07
PI-SERINI	gpt-5.5	bm25	83.13	15.73	$291.55	94.70	94.43	73.55	72.87	58.87	56.12
PI-SERINI	deepseek-v4-flash	bm25	68.07	15.52	$28.92	94.48	95.74	67.87	69.92	55.17	60.56
PI-SERINI	deepseek-v4-pro	bm25	71.43	6.95	$55.08	91.32	92.46	59.85	62.97	45.41	50.56

Key Findings

The bottleneck shifts from retrieval to evidence navigation.

Lexical retrieval is stronger than prior baselines suggest.

With tuned BM25 parameters and deeper retrieval, PI-SERINI surfaces high-recall candidate evidence before the agent begins deeper inspection.

Tool granularity enables deeper retrieval.

With search, read_search_results, and read_document, PI-SERINI caches a retrieved ranking, explores deeper retrieval results, and selectively decides which evidence enters its context window.

Navigation remains difficult.

Surfaced recall is much higher than previewed and behavior recall, suggesting that future gains may come from better evidence navigation rather than only better retrievers.

Evaluation cost can be reduced.

Time-budget steering and cache-friendly interaction make repeated full-benchmark evaluations more practical for Deep Research systems.

Citation

Cite PI-SERINI

Use the arXiv citation below.

@misc{hsu2026rethinkingagenticsearchpiserini,
  title         = {Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?},
  author        = {Tz-Huan Hsu and Jheng-Hong Yang and Jimmy Lin},
  year          = {2026},
  eprint        = {2605.10848},
  archivePrefix = {arXiv},
  primaryClass  = {cs.IR},
  url           = {https://arxiv.org/abs/2605.10848}
}