Vector Search: Beyond Keywords

The Keyword Ceiling

For decades, search has been built around one fundamental assumption: users know the right words. Type a keyword, the engine finds exact or fuzzy text matches, done. This model — powered by inverted indexes and scoring models like BM25 — works remarkably well for precise, well-defined queries.

But it breaks down when users don't know the right words. And in practice, they often don't.

Consider these real-world gaps:

"car" should match documents about "automobile" — but keyword search treats them as unrelated.
"how to reduce customer churn" should match an article titled "Retention strategies for SaaS businesses" — but there's zero keyword overlap.
A user searching for "comfortable shoes for standing all day" should find products tagged as "ergonomic footwear with arch support" — but the terms are different.

This is the keyword ceiling. And vector search is what breaks through it.

Semantic Similarity Engineering

Semantic Vector Retrieval

Unlike keyword search, vectors map meaning to geometry. Documents are positioned in embedding space based on their conceptual relationships.

Active Query"affordable family vacation"

Concept DetectedFrugality + Family

Semantic Nearest Neighbors (Top 3)

Similarity Score: 0.980budget-friendly trips with kids

Similarity Score: 0.960cheap summer getaways

Similarity Score: 0.940family travel deals

Distant Sample (High Distance)luxury business travel

Architecture StrategyCosine Similarity in HNSW 1,536-dim Embedding Space

Learn more about RRF Fusion

What Vector Search Actually Is

Vector search doesn't match words. It matches meaning.

Instead of comparing text strings, vector search works with embeddings — dense numerical representations of text in a high-dimensional space (typically 384 to 1,536 dimensions). In this space, semantically similar texts are positioned close together, regardless of the specific words they use.

The process:

At index time: Each document (or document chunk) is passed through an embedding model, producing a dense vector. This vector is stored alongside the traditional inverted index.
At query time: The user's query is passed through the same embedding model, producing a query vector.
Retrieval: The search engine finds documents whose vectors are closest to the query vector using distance metrics (cosine similarity, dot product, or L2 distance).

The result: a query for "affordable family vacation" can match a document about "budget-friendly trips with kids" — because the meaning is similar, even though the words are different.

How Embeddings Work

Embedding models are trained (usually via transformer architectures) to map text into vector spaces where semantic similarity corresponds to geometric proximity.

The Embedding Pipeline

"comfortable office chair for long hours"
        │
        ▼
┌─────────────────────┐
│  Embedding Model    │
│  (e.g., E5, BGE,   │
│   all-MiniLM)       │
└─────────────────────┘
        │
        ▼
[0.23, -0.11, 0.87, 0.45, ..., -0.33]  ← 384-1536 dimensions

Choosing an Embedding Model

The embedding model you choose directly impacts relevance quality. Key considerations:

Model	Dimensions	Strengths	Trade-offs
all-MiniLM-L6-v2	384	Fast, small, good for general use	Lower accuracy on domain-specific content
E5-large-v2	1024	Strong instruction-following	Larger, slower inference
BGE-large-en	1024	Excellent multilingual support	Requires careful prompt formatting
OpenAI text-embedding-3-large	3072	Very high quality	API dependency, cost at scale
Cohere embed-v3	1024	Strong multilingual, search-optimized	API dependency

Critical insight from production: Small changes in embedding models produce large changes in relevance. I've seen teams switch from all-MiniLM to BGE-large and see a 15-20% improvement in retrieval recall without changing anything else. The model matters more than most tuning parameters.

Domain-Specific Fine-tuning

General-purpose embedding models are trained on broad internet text. They work well for common language but struggle with domain-specific vocabulary — medical terminology, legal jargon, automotive part numbers, or real estate descriptions.

Fine-tuning an embedding model on your domain data can dramatically improve relevance. The typical approach:

Collect query-document pairs from your search logs (queries + the documents users actually clicked).
Generate hard negatives — documents that are superficially similar but not relevant.
Fine-tune using contrastive learning — train the model to push relevant pairs closer and irrelevant pairs apart.

Even 5,000-10,000 training pairs can produce meaningful improvements for domain-specific retrieval.

Approximate Nearest Neighbor (ANN) Search

Exact nearest-neighbor search in high-dimensional space is computationally prohibitive at scale. Searching through millions of vectors by computing distance to every single one is too slow.

ANN algorithms trade a small amount of accuracy for dramatic speed improvements:

HNSW (Hierarchical Navigable Small World)

The dominant algorithm in production search engines. HNSW builds a multi-layer graph where:

The top layer is sparse — large jumps between distant nodes.
Lower layers are progressively denser — fine-grained navigation.
Search starts at the top and "descends" through layers, narrowing toward the nearest neighbors.

Used by: Elasticsearch, OpenSearch, Lucene (the engine underneath both).

Tuning parameters:

m (max connections per node): Higher values improve recall but use more memory. Typical: 16-64.
ef_construction (build-time beam width): Higher values improve index quality at the cost of slower indexing. Typical: 100-200.
ef_search (query-time beam width): Higher values improve recall at the cost of higher latency. Typical: 100-400.

IVF (Inverted File Index)

Partitions the vector space into clusters (Voronoi cells). At query time, only the nearest clusters are searched.

Used by: FAISS (Meta), some Solr extensions.

Product Quantization (PQ)

Compresses vectors to reduce memory usage, enabling billion-scale vector search on limited hardware. Trades some accuracy for massive storage savings.

In practice, HNSW is the default choice for most production deployments under 100M vectors. Beyond that, you start combining HNSW with quantization.

Why Keyword Search Still Matters

Here's the contrarian take that most "vector search is the future" articles skip: keyword search is still better at many things.

Capability	Vector Search	Keyword Search
Semantic similarity	✅ Excels at meaning-based matching	❌ Requires explicit synonyms
Vocabulary mismatch	✅ Handles naturally	❌ Misses without synonyms
Cross-lingual retrieval	✅ Works with multilingual models	❌ Requires per-language setup
Short, ambiguous queries	✅ Infers intent from context	❌ Limited signal to work with
Exact matches (codes, IDs)	❌ Over-generalizes	✅ Precise and fast
Boolean precision	❌ No native support	✅ Must/must-not logic
Fielded queries	❌ Flat vector space	✅ Field-level targeting
Interpretability	❌ Black-box similarity	✅ Can explain why a result matched
Unseen edge cases	❌ Limited by training data	✅ No model dependency

A user searching for the exact product code ABC-123-XYZ doesn't need semantic understanding — they need an exact match. And BM25 handles that perfectly.

Hybrid Search: The Best of Both Worlds

The real answer isn't keyword OR vector. It's both.

Hybrid search combines lexical retrieval (BM25) and semantic retrieval (vector ANN) into a unified ranking pipeline. The most common approach is Reciprocal Rank Fusion (RRF):

Run BM25 and vector search independently.
Each produces its own ranked list.
Merge the lists using RRF, which scores each document based on its rank in both lists:

RRF_score(d) = Σ 1 / (k + rank_i(d))

Where k is a constant (typically 60) and rank_i(d) is the document's rank in result list i.

RRF is elegant because it's score-agnostic — you don't need to normalize BM25 scores and vector similarity scores onto the same scale. You just combine ranks.

Implementation in Elasticsearch

Elasticsearch supports hybrid search natively through the knn clause combined with traditional query clauses:

{
  "query": {
    "bool": {
      "should": [
        { "match": { "title": "comfortable office chair" } }
      ]
    }
  },
  "knn": {
    "field": "title_embedding",
    "query_vector": [0.23, -0.11, ...],
    "k": 10,
    "num_candidates": 100
  },
  "rank": {
    "rrf": { "rank_constant": 60 }
  }
}

When to Weight Lexical vs. Semantic

The optimal balance between BM25 and vector scores depends on your query distribution:

Mostly exact/product queries → Weight BM25 higher (70/30).
Mostly natural language/conversational → Weight vectors higher (30/70).
Mixed traffic → Start at 50/50 and tune from there using A/B testing.

Production Challenges

Vector search in production introduces challenges that don't exist with traditional keyword search:

Memory and Cost

Vectors are memory-intensive. A 1,024-dimensional float32 vector consumes ~4KB. At 10 million documents, that's 40GB of vector data alone — plus the HNSW graph overhead. Plan your hardware accordingly.

Mitigation strategies:

Use scalar quantization (float32 → float16 or int8) to halve or quarter memory usage.
Use product quantization for extreme compression.
Offload to GPU-accelerated vector stores for billion-scale deployments.

Embedding Drift

Language evolves. Products change. User vocabulary shifts. The embedding model that worked well at launch may degrade over time because the distribution of queries and documents drifts away from what the model was trained on.

Mitigation: Periodically re-evaluate retrieval quality. Track recall@k and nDCG@k against a judgment set. Fine-tune or swap models when metrics degrade.

Latency

ANN search adds latency compared to inverted-index lookups. On a well-tuned HNSW index, expect 10-50ms per vector query. Combined with BM25 execution, hybrid search latency typically falls in the 50-150ms range — acceptable for most applications, but something to monitor.

Chunking Strategy

For long documents, you typically can't embed the entire document as one vector (embedding models have token limits). You need a chunking strategy:

Fixed-size chunks (e.g., 512 tokens): Simple but may split concepts across chunks.
Semantic chunking: Use paragraph breaks, section headers, or sentence boundaries.
Sliding window: Overlapping chunks to avoid missing concepts at boundaries.

The chunking strategy directly impacts retrieval quality. Too large, and the embedding becomes diluted. Too small, and you lose context.

Vector Search + RAG

Vector search is the retrieval backbone of Retrieval-Augmented Generation (RAG) — the architecture pattern that grounds LLM responses in factual, indexed content.

In a RAG pipeline:

User asks a question.
The question is embedded into a vector.
Vector search retrieves the top-k most relevant document chunks.
The chunks are injected into the LLM's prompt as context.
The LLM generates an answer grounded in the retrieved content.

The quality of the RAG output is directly constrained by the quality of the vector retrieval. Bad retrieval → irrelevant context → hallucinated or wrong answers. This is why I often say: RAG is a search problem, not an LLM problem.

Where to Start

If you're considering adding vector search to an existing search system:

Don't replace — augment. Keep your BM25 pipeline. Add vector search as a parallel retrieval channel.
Start with a general-purpose embedding model (BGE or E5). Evaluate against your real queries before investing in fine-tuning.
Implement hybrid search with RRF. It's the safest approach and delivers reliable improvements without complex score calibration.
Monitor retrieval quality. Track recall@k and nDCG@k with a judgment set. Vector search isn't magic — it requires the same measurement discipline as keyword search.
Budget for memory. Vector search is significantly more resource-intensive than keyword search. Plan capacity before indexing.

The Bottom Line

Vector search isn't replacing keyword search — it's augmenting it. The future is hybrid.

The teams that will win at search relevance in the next decade are the ones that combine the precision of lexical matching with the understanding of semantic search. Neither alone is sufficient. Together, they cover the full spectrum of user intent.

Vector Search: Beyond Keywords

The Keyword Ceiling

Semantic Vector Retrieval

What Vector Search Actually Is

How Embeddings Work

The Embedding Pipeline

Choosing an Embedding Model

Domain-Specific Fine-tuning

Approximate Nearest Neighbor (ANN) Search

HNSW (Hierarchical Navigable Small World)

IVF (Inverted File Index)

Product Quantization (PQ)

Why Keyword Search Still Matters

Hybrid Search: The Best of Both Worlds

Implementation in Elasticsearch

When to Weight Lexical vs. Semantic

Production Challenges

Memory and Cost

Embedding Drift

Latency

Chunking Strategy

Vector Search + RAG

Where to Start

The Bottom Line

Apply Strategic Depth

Enterprise Advisory

RAG Health Audit

Search Relevance

Analyzers: The Hidden Engine of Search

Stopwords are Not as Harmless as They Look

The Keyword Ceiling

Semantic Vector Retrieval

What Vector Search Actually Is

How Embeddings Work

The Embedding Pipeline

Choosing an Embedding Model

Domain-Specific Fine-tuning

Approximate Nearest Neighbor (ANN) Search

HNSW (Hierarchical Navigable Small World)

IVF (Inverted File Index)

Product Quantization (PQ)

Why Keyword Search Still Matters

Hybrid Search: The Best of Both Worlds

Implementation in Elasticsearch

When to Weight Lexical vs. Semantic

Production Challenges

Memory and Cost

Embedding Drift

Latency

Chunking Strategy

Vector Search + RAG

Where to Start

The Bottom Line

Apply Strategic Depth

Enterprise Advisory

RAG Health Audit

Search Relevance

Analyzers: The Hidden Engine of Search

Stopwords are Not as Harmless as They Look

Search & Scale

Search Relevance

RAG Architecture

Engineering Scale

Graph Databases

Join the deep-dive.