Retrieval sits at the heart of modern AI applications. Whether you're building enterprise search, retrieval-augmented generation (RAG), knowledge assistants, recommendation systems, or document discovery platforms, the quality of retrieval often determines the quality of the final user experience. Even the most capable large language model can only work with the information it receives, making retrieval one of the most important components in an AI system.
Over the years, retrieval techniques have evolved from traditional keyword search to semantic vector search and, more recently, hybrid and multi-vector approaches. Terms such as BM25, dense embeddings, sparse embeddings, SPLADE, ColBERT, and hybrid retrieval are now common in AI discussions, but they often appear disconnected from one another.
In reality, most of these approaches are variations of a few core ideas.
This article explains those ideas from first principles and shows how the major retrieval techniques fit together.
The Goal of Retrieval
Before diving into dense vectors, sparse vectors, and modern retrieval techniques, it's worth asking a simple question: what problem are these systems trying to solve?
At its core, retrieval is about finding the most relevant documents for a given query. The challenge is that "relevant" can mean different things. Sometimes users want exact matches, such as product codes, error messages, or specific names. Other times, they want documents that express the same idea even when the wording is completely different.
Modern retrieval systems solve this by converting text into numerical representations and comparing those representations mathematically. Most approaches fall into two broad categories: sparse retrieval, which focuses on matching words, and dense retrieval, which focuses on matching meaning. As we'll see later, many production search and RAG systems combine both approaches to get the best of each.

The dream behind every embedding: turn text into a position, so that things which mean similar things end up near each other. “cat” sits next to “kitten,” far from “invoice.” Search then becomes finding the nearest neighbours.
That map is the whole motivation. The disagreement between dense and sparse is just how you assign those positions. Hold that picture and the rest falls into place.
1. Dense embeddings: meaning spread thin
A dense vector is a fairly short list of numbers, usually a few hundred to a couple thousand, where almost every slot has a value and none of them mean anything you can name. There's no slot that stands for “cat.” Instead, the idea of a cat is smeared across all the numbers at once, the way a face is spread across the pixels of a photo rather than living in any single pixel.
These come out of trained models, the sentence-transformer and OpenAI-style embedding models. You feed in text, you get back something like [0.21, -0.88, 0.04, 0.55, ...], and the only promise the model makes is that texts with similar meaning produce vectors that point in similar directions. That's what lets a search for “how do I reset my password” surface a document titled “forgot login credentials,” even though the two share no words. Dense vectors are built for that kind of fuzzy, human, meaning-level match.
The cost is that you can't read them. Open one up and there's nothing to inspect, just a wall of decimals. And because meaning is approximate, they can blur things you wanted kept apart, treating a specific product code or a person's surname as “close enough” to something it isn't.
2. Sparse embeddings: one slot per word
The sparse vector takes the opposite bet, and this is the one that trips people up, so we'll go slowly. Picture a dictionary that lists every word that could ever appear, say fifty thousand of them, and give each word its own fixed seat in a very long row. To turn a piece of text into a vector, you walk down that entire row of seats and, for each word, write down a number: how much is this word present in my text?
A short sentence only uses a tiny handful of those fifty thousand words. So almost every seat gets a zero. Only the seats for words you actually used get a real number. Mostly zeros is exactly what “sparse” means. Each slot is readable, because each slot is literally a word.
Here's that row for one sentence. Think of it as taking attendance on the whole dictionary:

A sparse vector is an attendance sheet for words. The word “the” scores 2 because it appears twice; everything not in the sentence scores zero. You store only the non-zero entries.
Notice there's no training and no mystery here. The number in the “cat” seat is just how present the word “cat” is. Anyone can read the vector and understand it, which is the exact thing dense vectors can't do.
How two sparse vectors get compared
This is the part that decides whether sparse search works at all, and it rests on one assumption people skip past: both texts sit on the same row of seats. Seat number 9 means “the” for every document, seat 4 means “mat” for every document. Because the seating is shared, you can line two documents up and walk them seat by seat.
In each seat, you multiply the two numbers together. Then you add up all those products. That single total is the similarity score. (The fancy name for “multiply matching slots and add them up” is a dot product, but the arithmetic is exactly what it sounds like.) Take two sentences:

Every seat where one document has a zero contributes nothing, because anything times zero is zero. Only words present in both texts can add to the score. Here the documents share just “the,” so the total is 2.
Look only at the bottom row. It's almost all zeros, for a simple reason: a seat can only produce a non-zero product if both documents put a number in it. The two sentences here share exactly one word, “the,” so that's the only seat that survives.
Which exposes a real problem. These sentences scored 2 entirely because of the word “the,” and “the” tells you nothing about what either sentence is about. Raw counts let a meaningless filler word drive the whole match.
That's the job of TF-IDF and BM25, the weighting schemes that real sparse search uses instead of plain counts. The instinct is two-sided: a word counts for more when it shows up a lot inside one document, and counts for less the more documents it appears in across the whole collection. “the” is in everything, so its weight gets crushed toward nothing. A rare word like “invoice” keeps a heavy weight. Same attendance-sheet structure, smarter numbers in the seats, so the words that survive the multiplication are the ones that actually carry meaning.
“Wait, isn't sparse just keyword search?”
If you're feeling that the last two sections describe ordinary keyword search, you're right, and it's worth being honest about why. Classic sparse vectors basically are keyword matching. The two phrases describe different things, though, which is where the confusion lives.
“Keyword matching” is a goal: find documents by the literal words they contain. “Sparse vector” is a representation: the row-of-word-seats structure we just drew. They're not rivals. One is the what, the other is the how. And BM25, the algorithm sitting under almost every keyword search engine you've used, Elasticsearch and friends, is literally a sparse-vector dot product. When a search bar ranks results, that ranking is the multiply-and-add we just did by hand.
So why give it a separate name at all? Two reasons that genuinely matter. First, the vector framing turns plain yes/no keyword hits into a ranked relevance score, so you get “this one's a 4, that one's a 2” instead of an unsorted pile of matches. Second, once keyword matching is shaped like a vector, it has the same form as a dense embedding, which means you can keep both in one system and blend their scores. Plain keyword search as a concept doesn't carry that.
There's exactly one place where sparse stops being keyword matching, and it has a name: SPLADE, a learned sparse model. A normal sparse vector can only light up seats for words that are literally in the text. SPLADE uses a trained model that also lights up seats for closely related words that never appeared, so a document about “car” gets a small weight in the “automobile” and “vehicle” seats too. That's no longer pure keyword matching. It's keyword-shaped, still readable, still a dot product, but with a dose of learned meaning folded in.

Keyword matching and sparse vectors overlap squarely on BM25, the same method seen from two angles. Sparse only pulls away from pure keyword search once you reach learned models like SPLADE. Dense lives in a different world entirely.
Single-Vector vs Multi-Vector Retrieval
Everything we've discussed so far shares one assumption: an entire document is ultimately represented by a single vector.
This works well for short pieces of text, but larger documents present a challenge. A technical document, product manual, or knowledge-base article may contain many different ideas. Compressing all of that information into one vector can cause important details to be diluted or lost.
Multi-vector retrieval takes a different approach. Instead of representing a document with a single vector, it maintains multiple representations for different parts of the document. This allows the retrieval system to compare a query against specific portions of a document rather than only its overall meaning.
The result is often better retrieval quality, particularly for longer documents and more detailed queries where small pieces of information can have a significant impact on relevance.
ColBERT and Late Interaction
ColBERT is one of the most widely used multi-vector retrieval models.
Rather than generating a single embedding for an entire document, ColBERT creates embeddings for individual tokens. During retrieval, each query token is matched against the most relevant document token, and those token-level matches are combined into a final relevance score.
This approach, known as late interaction, preserves fine-grained information that would otherwise be compressed away in a single-vector representation. As a result, ColBERT often performs better on longer documents and complex queries, though it requires more storage and computational resources than traditional dense retrieval.
Hybrid Retrieval: Where Most Production Systems Land
Dense and sparse retrieval are often presented as competing approaches, but most production systems use both.
Sparse retrieval excels at exact matching. It is particularly effective for product codes, entity names, error messages, and domain-specific terminology where precise words matter.
Dense retrieval excels at semantic matching. It can connect queries and documents that express the same idea using different wording.
Because these strengths are complementary, modern retrieval systems frequently run dense and sparse retrieval in parallel and combine their scores during ranking. This hybrid approach provides both semantic understanding and exact-match precision, making it one of the most common retrieval strategies used in enterprise search and RAG applications today.
Modern Retrieval Pipelines
Modern retrieval systems rarely rely on a single retrieval technique. Instead, they are typically built as multi-stage pipelines.
A common architecture may use sparse retrieval, dense retrieval, or both to generate an initial set of candidate documents. Those candidates can then be refined using multi-vector models such as ColBERT and finally re-ranked using cross-encoders or LLM-based rankers.
Rather than competing with one another, these techniques work together. Each stage contributes a different signal, allowing the overall system to achieve higher retrieval quality than any single approach alone.
Ultimately, every retrieval method involves the same trade-off: deciding what information to preserve and what information to compress. Understanding that trade-off provides a useful framework for understanding almost every retrieval technique used in modern search and RAG systems.














