Retrieval is fast but noisy; rerankers are slow but precise. The pattern: retrieve top 50-100 with cheap methods, then rerank to top 5-10 with an expensive cross-encoder that jointly scores query + doc. Cross-encoders catch relevance nuances embeddings miss (entailment, temporal fit, exact entity match).
2026 options: open-weight cross-encoders (BAAI/bge-reranker, Voyage rerank-2, Cohere Rerank 3), or LLM-as-reranker (send top-50 to a smart model and ask it to order by relevance). LLM rerankers cost more but handle complex query intent better. RAG quality almost always improves with a reranker; it's the single highest-leverage retrieval upgrade.
Example Prompt
# Two-stage retrieve + rerank
candidates = hybrid_search(query, k=50) # cheap + fast, noisy
# Cross-encoder rerank
scores = cross_encoder.predict([(query, doc.text) for doc in candidates])
ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
top_10 = [doc for doc, _ in ranked[:10]]When to use it
- RAG quality plateau despite tuning retrieval
- Queries with nuanced intent embeddings miss
- You can afford 100-300ms of additional latency
When NOT to use it
- Very cheap retrieval budgets (reranker is 5-50x the cost of embed-only)
- Your retrieval is already near-perfect
- Hard-real-time use cases
