Vector Search: Beyond Keywords

Published Mar 14, 2026
Insight Depth 10 min read
Share Insight

The Keyword Ceiling

For decades, search has been built around one fundamental assumption: users know the right words. Type a keyword, the engine finds exact or fuzzy text matches, done. This model — powered by inverted indexes and scoring models like BM25 — works remarkably well for precise, well-defined queries.

But it breaks down when users don't know the right words. And in practice, they often don't.

Consider these real-world gaps:

  • "car" should match documents about "automobile" — but keyword search treats them as unrelated.
  • "how to reduce customer churn" should match an article titled "Retention strategies for SaaS businesses" — but there's zero keyword overlap.
  • A user searching for "comfortable shoes for standing all day" should find products tagged as "ergonomic footwear with arch support" — but the terms are different.

This is the keyword ceiling. And vector search is what breaks through it.

Semantic Similarity Engineering

Semantic Vector Retrieval

Unlike keyword search, vectors map meaning to geometry. Documents are positioned in embedding space based on their conceptual relationships.

Active Query"affordable family vacation"
Concept DetectedFrugality + Family
Semantic Nearest Neighbors (Top 3)
01
Similarity Score: 0.980budget-friendly trips with kids
02
Similarity Score: 0.960cheap summer getaways
03
Similarity Score: 0.940family travel deals
Distant Sample (High Distance)luxury business travel
Architecture StrategyCosine Similarity in HNSW 1,536-dim Embedding Space
Learn more about RRF Fusion

What Vector Search Actually Is

Vector search doesn't match words. It matches meaning.

Instead of comparing text strings, vector search works with embeddings — dense numerical representations of text in a high-dimensional space (typically 384 to 1,536 dimensions). In this space, semantically similar texts are positioned close together, regardless of the specific words they use.

The process:

  1. At index time: Each document (or document chunk) is passed through an embedding model, producing a dense vector. This vector is stored alongside the traditional inverted index.
  2. At query time: The user's query is passed through the same embedding model, producing a query vector.
  3. Retrieval: The search engine finds documents whose vectors are closest to the query vector using distance metrics (cosine similarity, dot product, or L2 distance).

The result: a query for "affordable family vacation" can match a document about "budget-friendly trips with kids" — because the meaning is similar, even though the words are different.

How Embeddings Work

Embedding models are trained (usually via transformer architectures) to map text into vector spaces where semantic similarity corresponds to geometric proximity.

The Embedding Pipeline

"comfortable office chair for long hours"
        │
        ▼
┌─────────────────────┐
│  Embedding Model    │
│  (e.g., E5, BGE,   │
│   all-MiniLM)       │
└─────────────────────┘
        │
        ▼
[0.23, -0.11, 0.87, 0.45, ..., -0.33]  ← 384-1536 dimensions

Choosing an Embedding Model

The embedding model you choose directly impacts relevance quality. Key considerations:

ModelDimensionsStrengthsTrade-offs
all-MiniLM-L6-v2384Fast, small, good for general useLower accuracy on domain-specific content
E5-large-v21024Strong instruction-followingLarger, slower inference
BGE-large-en1024Excellent multilingual supportRequires careful prompt formatting
OpenAI text-embedding-3-large3072Very high qualityAPI dependency, cost at scale
Cohere embed-v31024Strong multilingual, search-optimizedAPI dependency

Critical insight from production: Small changes in embedding models produce large changes in relevance. I've seen teams switch from all-MiniLM to BGE-large and see a 15-20% improvement in retrieval recall without changing anything else. The model matters more than most tuning parameters.

Domain-Specific Fine-tuning

General-purpose embedding models are trained on broad internet text. They work well for common language but struggle with domain-specific vocabulary — medical terminology, legal jargon, automotive part numbers, or real estate descriptions.

Fine-tuning an embedding model on your domain data can dramatically improve relevance. The typical approach:

  1. Collect query-document pairs from your search logs (queries + the documents users actually clicked).
  2. Generate hard negatives — documents that are superficially similar but not relevant.
  3. Fine-tune using contrastive learning — train the model to push relevant pairs closer and irrelevant pairs apart.

Even 5,000-10,000 training pairs can produce meaningful improvements for domain-specific retrieval.

Approximate Nearest Neighbor (ANN) Search

Exact nearest-neighbor search in high-dimensional space is computationally prohibitive at scale. Searching through millions of vectors by computing distance to every single one is too slow.

ANN algorithms trade a small amount of accuracy for dramatic speed improvements:

HNSW (Hierarchical Navigable Small World)

The dominant algorithm in production search engines. HNSW builds a multi-layer graph where:

  • The top layer is sparse — large jumps between distant nodes.
  • Lower layers are progressively denser — fine-grained navigation.
  • Search starts at the top and "descends" through layers, narrowing toward the nearest neighbors.

Used by: Elasticsearch, OpenSearch, Lucene (the engine underneath both).

Tuning parameters:

  • m (max connections per node): Higher values improve recall but use more memory. Typical: 16-64.
  • ef_construction (build-time beam width): Higher values improve index quality at the cost of slower indexing. Typical: 100-200.
  • ef_search (query-time beam width): Higher values improve recall at the cost of higher latency. Typical: 100-400.

IVF (Inverted File Index)

Partitions the vector space into clusters (Voronoi cells). At query time, only the nearest clusters are searched.

Used by: FAISS (Meta), some Solr extensions.

Product Quantization (PQ)

Compresses vectors to reduce memory usage, enabling billion-scale vector search on limited hardware. Trades some accuracy for massive storage savings.

In practice, HNSW is the default choice for most production deployments under 100M vectors. Beyond that, you start combining HNSW with quantization.

Why Keyword Search Still Matters

Here's the contrarian take that most "vector search is the future" articles skip: keyword search is still better at many things.

CapabilityVector SearchKeyword Search
Semantic similarity✅ Excels at meaning-based matching❌ Requires explicit synonyms
Vocabulary mismatch✅ Handles naturally❌ Misses without synonyms
Cross-lingual retrieval✅ Works with multilingual models❌ Requires per-language setup
Short, ambiguous queries✅ Infers intent from context❌ Limited signal to work with
Exact matches (codes, IDs)❌ Over-generalizes✅ Precise and fast
Boolean precision❌ No native support✅ Must/must-not logic
Fielded queries❌ Flat vector space✅ Field-level targeting
Interpretability❌ Black-box similarity✅ Can explain why a result matched
Unseen edge cases❌ Limited by training data✅ No model dependency

A user searching for the exact product code ABC-123-XYZ doesn't need semantic understanding — they need an exact match. And BM25 handles that perfectly.

Hybrid Search: The Best of Both Worlds

The real answer isn't keyword OR vector. It's both.

Hybrid search combines lexical retrieval (BM25) and semantic retrieval (vector ANN) into a unified ranking pipeline. The most common approach is Reciprocal Rank Fusion (RRF):

  1. Run BM25 and vector search independently.
  2. Each produces its own ranked list.
  3. Merge the lists using RRF, which scores each document based on its rank in both lists:
RRF_score(d) = Σ 1 / (k + rank_i(d))

Where k is a constant (typically 60) and rank_i(d) is the document's rank in result list i.

RRF is elegant because it's score-agnostic — you don't need to normalize BM25 scores and vector similarity scores onto the same scale. You just combine ranks.

Implementation in Elasticsearch

Elasticsearch supports hybrid search natively through the knn clause combined with traditional query clauses:

{
  "query": {
    "bool": {
      "should": [
        { "match": { "title": "comfortable office chair" } }
      ]
    }
  },
  "knn": {
    "field": "title_embedding",
    "query_vector": [0.23, -0.11, ...],
    "k": 10,
    "num_candidates": 100
  },
  "rank": {
    "rrf": { "rank_constant": 60 }
  }
}

When to Weight Lexical vs. Semantic

The optimal balance between BM25 and vector scores depends on your query distribution:

  • Mostly exact/product queries → Weight BM25 higher (70/30).
  • Mostly natural language/conversational → Weight vectors higher (30/70).
  • Mixed traffic → Start at 50/50 and tune from there using A/B testing.

Production Challenges

Vector search in production introduces challenges that don't exist with traditional keyword search:

Memory and Cost

Vectors are memory-intensive. A 1,024-dimensional float32 vector consumes ~4KB. At 10 million documents, that's 40GB of vector data alone — plus the HNSW graph overhead. Plan your hardware accordingly.

Mitigation strategies:

  • Use scalar quantization (float32 → float16 or int8) to halve or quarter memory usage.
  • Use product quantization for extreme compression.
  • Offload to GPU-accelerated vector stores for billion-scale deployments.

Embedding Drift

Language evolves. Products change. User vocabulary shifts. The embedding model that worked well at launch may degrade over time because the distribution of queries and documents drifts away from what the model was trained on.

Mitigation: Periodically re-evaluate retrieval quality. Track recall@k and nDCG@k against a judgment set. Fine-tune or swap models when metrics degrade.

Latency

ANN search adds latency compared to inverted-index lookups. On a well-tuned HNSW index, expect 10-50ms per vector query. Combined with BM25 execution, hybrid search latency typically falls in the 50-150ms range — acceptable for most applications, but something to monitor.

Chunking Strategy

For long documents, you typically can't embed the entire document as one vector (embedding models have token limits). You need a chunking strategy:

  • Fixed-size chunks (e.g., 512 tokens): Simple but may split concepts across chunks.
  • Semantic chunking: Use paragraph breaks, section headers, or sentence boundaries.
  • Sliding window: Overlapping chunks to avoid missing concepts at boundaries.

The chunking strategy directly impacts retrieval quality. Too large, and the embedding becomes diluted. Too small, and you lose context.

Vector Search + RAG

Vector search is the retrieval backbone of Retrieval-Augmented Generation (RAG) — the architecture pattern that grounds LLM responses in factual, indexed content.

In a RAG pipeline:

  1. User asks a question.
  2. The question is embedded into a vector.
  3. Vector search retrieves the top-k most relevant document chunks.
  4. The chunks are injected into the LLM's prompt as context.
  5. The LLM generates an answer grounded in the retrieved content.

The quality of the RAG output is directly constrained by the quality of the vector retrieval. Bad retrieval → irrelevant context → hallucinated or wrong answers. This is why I often say: RAG is a search problem, not an LLM problem.

Where to Start

If you're considering adding vector search to an existing search system:

  1. Don't replace — augment. Keep your BM25 pipeline. Add vector search as a parallel retrieval channel.
  2. Start with a general-purpose embedding model (BGE or E5). Evaluate against your real queries before investing in fine-tuning.
  3. Implement hybrid search with RRF. It's the safest approach and delivers reliable improvements without complex score calibration.
  4. Monitor retrieval quality. Track recall@k and nDCG@k with a judgment set. Vector search isn't magic — it requires the same measurement discipline as keyword search.
  5. Budget for memory. Vector search is significantly more resource-intensive than keyword search. Plan capacity before indexing.

The Bottom Line

Vector search isn't replacing keyword search — it's augmenting it. The future is hybrid.

The teams that will win at search relevance in the next decade are the ones that combine the precision of lexical matching with the understanding of semantic search. Neither alone is sufficient. Together, they cover the full spectrum of user intent.

Productized Consulting

Apply Strategic Depth

Enterprise Only10M+ Documents

Enterprise Advisory

Strategic partnership for Engineering Leads and CTOs. I bridge the gaps in your Search, AI, and Distributed Infrastructure.

Retainer

Inquiry Only
Strategic Call
Deep-Dive3-Day Audit

RAG Health Audit

Diagnostics for retrieval precision, chunking strategy, and evaluation protocols.

Fixed Scope

€5k+
Strategic Call
Precision1-Week Sprint

Search Relevance

Hybrid Search implementation, scoring refinement, and analyzer tuning at the 1M+ level.

Performance

€3.5k+
Strategic Call
Previous
Analyzers: The Hidden Engine of Search
Next
Stopwords are Not as Harmless as They Look
Weekly Architectural Depth

Search & Scale

Architectural deep-dives on building search, AI, and microservices for 10M+ environments. Delivered every week.

Search Relevance

Beyond BM25: Practical ways to tune vector & hybrid search for production.

RAG Architecture

Solving the retrieval precision and scale issues that kill hobby projects.

Engineering Scale

Java & Python microservices that handle 100M+ monthly requests with zero downtime.

Graph Databases

Empowering relationship-aware insights with graph databases and advanced analytics

Said Bouigherdaine
2.4k+Subscribers
42%Avg. Open Rate

Join the deep-dive.

Enter your email for architectural guides on scaling search and AI systems. Direct to your inbox.

Interested in:

No fluff. Just architecture. Unsubscribe anytime.