Vector Search: Beyond Keywords
The Keyword Ceiling
For decades, search has been built around one fundamental assumption: users know the right words. Type a keyword, the engine finds exact or fuzzy text matches, done. This model — powered by inverted indexes and scoring models like BM25 — works remarkably well for precise, well-defined queries.
But it breaks down when users don't know the right words. And in practice, they often don't.
Consider these real-world gaps:
- "car" should match documents about "automobile" — but keyword search treats them as unrelated.
- "how to reduce customer churn" should match an article titled "Retention strategies for SaaS businesses" — but there's zero keyword overlap.
- A user searching for "comfortable shoes for standing all day" should find products tagged as "ergonomic footwear with arch support" — but the terms are different.
This is the keyword ceiling. And vector search is what breaks through it.
Semantic Vector Retrieval
Unlike keyword search, vectors map meaning to geometry. Documents are positioned in embedding space based on their conceptual relationships.
What Vector Search Actually Is
Vector search doesn't match words. It matches meaning.
Instead of comparing text strings, vector search works with embeddings — dense numerical representations of text in a high-dimensional space (typically 384 to 1,536 dimensions). In this space, semantically similar texts are positioned close together, regardless of the specific words they use.
The process:
- At index time: Each document (or document chunk) is passed through an embedding model, producing a dense vector. This vector is stored alongside the traditional inverted index.
- At query time: The user's query is passed through the same embedding model, producing a query vector.
- Retrieval: The search engine finds documents whose vectors are closest to the query vector using distance metrics (cosine similarity, dot product, or L2 distance).
The result: a query for "affordable family vacation" can match a document about "budget-friendly trips with kids" — because the meaning is similar, even though the words are different.
How Embeddings Work
Embedding models are trained (usually via transformer architectures) to map text into vector spaces where semantic similarity corresponds to geometric proximity.
The Embedding Pipeline
"comfortable office chair for long hours"
│
▼
┌─────────────────────┐
│ Embedding Model │
│ (e.g., E5, BGE, │
│ all-MiniLM) │
└─────────────────────┘
│
▼
[0.23, -0.11, 0.87, 0.45, ..., -0.33] ← 384-1536 dimensions
Choosing an Embedding Model
The embedding model you choose directly impacts relevance quality. Key considerations:
| Model | Dimensions | Strengths | Trade-offs |
|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | Fast, small, good for general use | Lower accuracy on domain-specific content |
| E5-large-v2 | 1024 | Strong instruction-following | Larger, slower inference |
| BGE-large-en | 1024 | Excellent multilingual support | Requires careful prompt formatting |
| OpenAI text-embedding-3-large | 3072 | Very high quality | API dependency, cost at scale |
| Cohere embed-v3 | 1024 | Strong multilingual, search-optimized | API dependency |
Critical insight from production: Small changes in embedding models produce large changes in relevance. I've seen teams switch from all-MiniLM to BGE-large and see a 15-20% improvement in retrieval recall without changing anything else. The model matters more than most tuning parameters.
Domain-Specific Fine-tuning
General-purpose embedding models are trained on broad internet text. They work well for common language but struggle with domain-specific vocabulary — medical terminology, legal jargon, automotive part numbers, or real estate descriptions.
Fine-tuning an embedding model on your domain data can dramatically improve relevance. The typical approach:
- Collect query-document pairs from your search logs (queries + the documents users actually clicked).
- Generate hard negatives — documents that are superficially similar but not relevant.
- Fine-tune using contrastive learning — train the model to push relevant pairs closer and irrelevant pairs apart.
Even 5,000-10,000 training pairs can produce meaningful improvements for domain-specific retrieval.
Approximate Nearest Neighbor (ANN) Search
Exact nearest-neighbor search in high-dimensional space is computationally prohibitive at scale. Searching through millions of vectors by computing distance to every single one is too slow.
ANN algorithms trade a small amount of accuracy for dramatic speed improvements:
HNSW (Hierarchical Navigable Small World)
The dominant algorithm in production search engines. HNSW builds a multi-layer graph where:
- The top layer is sparse — large jumps between distant nodes.
- Lower layers are progressively denser — fine-grained navigation.
- Search starts at the top and "descends" through layers, narrowing toward the nearest neighbors.
Used by: Elasticsearch, OpenSearch, Lucene (the engine underneath both).
Tuning parameters:
m(max connections per node): Higher values improve recall but use more memory. Typical: 16-64.ef_construction(build-time beam width): Higher values improve index quality at the cost of slower indexing. Typical: 100-200.ef_search(query-time beam width): Higher values improve recall at the cost of higher latency. Typical: 100-400.
IVF (Inverted File Index)
Partitions the vector space into clusters (Voronoi cells). At query time, only the nearest clusters are searched.
Used by: FAISS (Meta), some Solr extensions.
Product Quantization (PQ)
Compresses vectors to reduce memory usage, enabling billion-scale vector search on limited hardware. Trades some accuracy for massive storage savings.
In practice, HNSW is the default choice for most production deployments under 100M vectors. Beyond that, you start combining HNSW with quantization.
Why Keyword Search Still Matters
Here's the contrarian take that most "vector search is the future" articles skip: keyword search is still better at many things.
| Capability | Vector Search | Keyword Search |
|---|---|---|
| Semantic similarity | ✅ Excels at meaning-based matching | ❌ Requires explicit synonyms |
| Vocabulary mismatch | ✅ Handles naturally | ❌ Misses without synonyms |
| Cross-lingual retrieval | ✅ Works with multilingual models | ❌ Requires per-language setup |
| Short, ambiguous queries | ✅ Infers intent from context | ❌ Limited signal to work with |
| Exact matches (codes, IDs) | ❌ Over-generalizes | ✅ Precise and fast |
| Boolean precision | ❌ No native support | ✅ Must/must-not logic |
| Fielded queries | ❌ Flat vector space | ✅ Field-level targeting |
| Interpretability | ❌ Black-box similarity | ✅ Can explain why a result matched |
| Unseen edge cases | ❌ Limited by training data | ✅ No model dependency |
A user searching for the exact product code ABC-123-XYZ doesn't need semantic understanding — they need an exact match. And BM25 handles that perfectly.
Hybrid Search: The Best of Both Worlds
The real answer isn't keyword OR vector. It's both.
Hybrid search combines lexical retrieval (BM25) and semantic retrieval (vector ANN) into a unified ranking pipeline. The most common approach is Reciprocal Rank Fusion (RRF):
- Run BM25 and vector search independently.
- Each produces its own ranked list.
- Merge the lists using RRF, which scores each document based on its rank in both lists:
RRF_score(d) = Σ 1 / (k + rank_i(d))
Where k is a constant (typically 60) and rank_i(d) is the document's rank in result list i.
RRF is elegant because it's score-agnostic — you don't need to normalize BM25 scores and vector similarity scores onto the same scale. You just combine ranks.
Implementation in Elasticsearch
Elasticsearch supports hybrid search natively through the knn clause combined with traditional query clauses:
{
"query": {
"bool": {
"should": [
{ "match": { "title": "comfortable office chair" } }
]
}
},
"knn": {
"field": "title_embedding",
"query_vector": [0.23, -0.11, ...],
"k": 10,
"num_candidates": 100
},
"rank": {
"rrf": { "rank_constant": 60 }
}
}
When to Weight Lexical vs. Semantic
The optimal balance between BM25 and vector scores depends on your query distribution:
- Mostly exact/product queries → Weight BM25 higher (70/30).
- Mostly natural language/conversational → Weight vectors higher (30/70).
- Mixed traffic → Start at 50/50 and tune from there using A/B testing.
Production Challenges
Vector search in production introduces challenges that don't exist with traditional keyword search:
Memory and Cost
Vectors are memory-intensive. A 1,024-dimensional float32 vector consumes ~4KB. At 10 million documents, that's 40GB of vector data alone — plus the HNSW graph overhead. Plan your hardware accordingly.
Mitigation strategies:
- Use scalar quantization (float32 → float16 or int8) to halve or quarter memory usage.
- Use product quantization for extreme compression.
- Offload to GPU-accelerated vector stores for billion-scale deployments.
Embedding Drift
Language evolves. Products change. User vocabulary shifts. The embedding model that worked well at launch may degrade over time because the distribution of queries and documents drifts away from what the model was trained on.
Mitigation: Periodically re-evaluate retrieval quality. Track recall@k and nDCG@k against a judgment set. Fine-tune or swap models when metrics degrade.
Latency
ANN search adds latency compared to inverted-index lookups. On a well-tuned HNSW index, expect 10-50ms per vector query. Combined with BM25 execution, hybrid search latency typically falls in the 50-150ms range — acceptable for most applications, but something to monitor.
Chunking Strategy
For long documents, you typically can't embed the entire document as one vector (embedding models have token limits). You need a chunking strategy:
- Fixed-size chunks (e.g., 512 tokens): Simple but may split concepts across chunks.
- Semantic chunking: Use paragraph breaks, section headers, or sentence boundaries.
- Sliding window: Overlapping chunks to avoid missing concepts at boundaries.
The chunking strategy directly impacts retrieval quality. Too large, and the embedding becomes diluted. Too small, and you lose context.
Vector Search + RAG
Vector search is the retrieval backbone of Retrieval-Augmented Generation (RAG) — the architecture pattern that grounds LLM responses in factual, indexed content.
In a RAG pipeline:
- User asks a question.
- The question is embedded into a vector.
- Vector search retrieves the top-k most relevant document chunks.
- The chunks are injected into the LLM's prompt as context.
- The LLM generates an answer grounded in the retrieved content.
The quality of the RAG output is directly constrained by the quality of the vector retrieval. Bad retrieval → irrelevant context → hallucinated or wrong answers. This is why I often say: RAG is a search problem, not an LLM problem.
Where to Start
If you're considering adding vector search to an existing search system:
- Don't replace — augment. Keep your BM25 pipeline. Add vector search as a parallel retrieval channel.
- Start with a general-purpose embedding model (BGE or E5). Evaluate against your real queries before investing in fine-tuning.
- Implement hybrid search with RRF. It's the safest approach and delivers reliable improvements without complex score calibration.
- Monitor retrieval quality. Track recall@k and nDCG@k with a judgment set. Vector search isn't magic — it requires the same measurement discipline as keyword search.
- Budget for memory. Vector search is significantly more resource-intensive than keyword search. Plan capacity before indexing.
The Bottom Line
Vector search isn't replacing keyword search — it's augmenting it. The future is hybrid.
The teams that will win at search relevance in the next decade are the ones that combine the precision of lexical matching with the understanding of semantic search. Neither alone is sufficient. Together, they cover the full spectrum of user intent.
Apply Strategic Depth
Enterprise Advisory
Strategic partnership for Engineering Leads and CTOs. I bridge the gaps in your Search, AI, and Distributed Infrastructure.
Retainer
Inquiry OnlyRAG Health Audit
Diagnostics for retrieval precision, chunking strategy, and evaluation protocols.
Fixed Scope
€5k+Search Relevance
Hybrid Search implementation, scoring refinement, and analyzer tuning at the 1M+ level.
Performance
€3.5k+