Hybrid Search: BM25 and Embeddings Are Better Together

A user typed ERR-4021 into the support search and got back a warm essay about general error handling. Not the runbook that mentions ERR-4021 by name, which exists, which we had indexed, which the old keyword search would have found in one hop. The new shiny vector search sailed right past it.

That was the morning I stopped believing the demos.

Dense retrieval is genuinely good at meaning. Ask it a fuzzy question in your own words and it pulls back chunks that are about the same thing even when they share not one word with your query. That is the trick that makes the demo land. The query says “how do I get my money back” and the document says “refund policy” and the embedding model knows those are the same idea. Real magic, no notes.

Here is the part nobody tells you. That same model is mediocre at the thing your users do constantly: typing an exact token and expecting the exact token back. A product SKU. An error code. A function name. A surname. A legal clause number. Embeddings smear all of that into a dense vector where ERR-4021 and ERR-4022 sit basically on top of each other, because to the model they look like the same shape of thing. Cosine similarity cannot tell them apart. Keyword search never confused them for a second.

Where dense-only quietly fails

I keep a short list of the cases that broke retrieval in production, because they are unintuitive until you have been bitten.

Acronyms and IDs. SOX, PII, ERR-4021, v2.3.1. Rare tokens carry almost no signal in an embedding because the model barely saw them in training. They carry enormous signal in BM25, which rewards rarity directly.

Negations. “refund not allowed” and “refund allowed” embed to nearly the same point. The model captures the topic, not the polarity. Dense retrieval will happily hand you the document that says the opposite of what the user needs and rank it first.

Exact phrases. When someone quotes a clause or a config key verbatim, they want the literal match, not the semantic neighborhood. Vector search treats their precise wording as a vague hint.

None of this means embeddings are bad. It means they have a blind spot, and the blind spot is precisely the high-stakes, low-tolerance queries where a wrong answer costs you trust. So you stop choosing. You run both.

Two retrievers, one query

BM25 is the lexical workhorse from the keyword era, and it never stopped being good at what it does. It scores documents by how many of your rare query terms appear, weighted so common words count for little and rare words count for a lot. Dense retrieval scores by cosine similarity in embedding space. They fail in opposite directions, which is the whole reason to keep both.

def hybrid_retrieve(query, k=50):
    # Two independent searches over the SAME corpus.
    # BM25 catches the exact tokens embeddings smear away;
    # dense catches the paraphrases BM25 is blind to.
    lexical = bm25_index.search(query, top_k=k)   # [(doc_id, bm25_score), ...]
    qvec    = embed(query)
    dense   = vec_index.search(qvec, top_k=k)      # [(doc_id, cos_sim), ...]
    return lexical, dense

Now you have two ranked lists that disagree. The obvious move is to normalize both score sets onto the same scale and add them. Do not do that. BM25 scores are unbounded and corpus-dependent, cosine similarity lives in a tidy range, and any normalization you pick is a fudge factor you will be hand-tuning forever. I tried it. I spent a week on min-max scaling and weighting coefficients before I admitted I was polishing a guess.

Fuse on rank, not on score

Reciprocal rank fusion throws the scores away and keeps only the position. A document that lands near the top of either list gets a high fused score, and the magnitudes never have to agree because they are never compared. It is almost insultingly simple and it beat every weighted-score scheme I built.

def reciprocal_rank_fusion(ranked_lists, k=60):
    # k is a smoothing constant, not the top-k from retrieval.
    # 60 is the value from the original RRF paper and I have
    # never found a reason to move it. Bigger k flattens the
    # contribution of rank; smaller k makes the #1 spot dominate.
    scores = {}
    for ranked in ranked_lists:
        for rank, (doc_id, _score) in enumerate(ranked):
            # note we ignore _score entirely. that is the point.
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)


def search(query, final_k=10):
    lexical, dense = hybrid_retrieve(query, k=50)
    fused = reciprocal_rank_fusion([lexical, dense])
    return fused[:final_k]

The first time I ran this against my eval set, the ERR-4021 query came back correct. So did the refund-policy paraphrase that BM25 alone had missed. Neither retriever was carrying the other. They were covering each other’s holes. The documents both lists agreed on floated to the top, and the ones only one list found still got a fair shot because a strong rank in a single list is enough to survive fusion.

Then rerank, because fusion is coarse

RRF gives you a good candidate set, not a final order. It knows where a document ranked, not how relevant it actually is to this specific query. So I add one more pass: take the top forty or fifty fused candidates and run them through a cross-encoder reranker, which reads the query and each document together and scores the pair directly.

def rerank(query, candidates, keep=8):
    # cross-encoder reads (query, doc) as one sequence, so it sees
    # word-level interaction the bi-encoder never could. slow, but
    # we only ever feed it ~50 docs, not the whole corpus.
    pairs  = [(query, store[doc_id].text) for doc_id, _ in candidates]
    scores = cross_encoder.predict(pairs)
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [c[0] for c in ranked[:keep]]

This is the step that turned “the right answer is somewhere in the top ten” into “the right answer is at position one or two.” The reranker is too expensive to run over a whole corpus, which is exactly why the retrieve-then-fuse-then-rerank shape exists. Cheap retrievers cast a wide net, fusion narrows it honestly, and the expensive model only ever looks at a few dozen finalists.

The latency bill, which is real

I am not going to pretend this is free. You went from one index lookup to a lexical search, an embedding call, a vector search, a fusion step, and a cross-encoder pass over fifty documents. The embedding and the rerank are the costs that matter. The rest is noise.

A few things that kept it sane. Run BM25 and the dense search concurrently, since they share nothing until fusion, so the two retrievals cost you the max of the two and not the sum. Cache the query embedding, because the same questions recur far more than you would guess. And keep the rerank candidate pool small. The jump from twenty to fifty candidates bought me real recall. The jump from fifty to two hundred bought me latency and a rounding error of quality.

The honest tradeoff is this. Hybrid plus reranking roughly tripled my retrieval latency against dense-only. It also stopped the search from confidently failing on every error code, SKU, and clause number a user could type, which in a support and internal-docs setting is most of the queries that actually matter. I will pay a hundred milliseconds to not return the opposite of the truth.

The teams that ship pure vector search and call it done are usually the teams whose eval set is all soft, paraphrase-style questions, because those are the questions that demo well. Put a single error code in your eval set. Put a negation in it. Put one verbatim clause. Then watch which retriever survives, and whether you can still afford to choose.