Stopwords are Not as Harmless as They Look

Published Mar 15, 2026
Insight Depth 8 min read
Share Insight

The Default That Breaks Things

Words like "the," "in," "at," "of," "to," "is" — search engines call them stopwords and most systems strip them out by default. The reasoning seems sound: these words appear in nearly every document, carry minimal semantic weight, and inflate index size. Removing them makes the index smaller and queries faster.

But in search relevance, removing them can break real queries in ways that are immediately visible to users and silently devastating to metrics.

When Stopwords Carry Meaning

The assumption behind stopword removal is that these words are informationally empty — noise that dilutes relevance scores. That assumption is wrong more often than most search engineers realize.

Example 1: "The Office"

  • With stopwords preserved: Matches the TV show The Office.
  • With stopwords removed: Becomes just "office" → matches office furniture, office supplies, coworking spaces, corporate dashboards.

The word "the" here is not filler. It's a disambiguator. It transforms a generic noun into a specific proper noun.

Example 2: "What to Expect When You're Expecting"

  • With stopwords preserved: Matches the book/movie title exactly.
  • With stopwords removed: Becomes "expect expecting" — which could match anything about project expectations, delivery timelines, or weather forecasts.

Example 3: "Not Waterproof"

  • With stopwords preserved: Correctly indicates a negation — the user wants products that are NOT waterproof, or is checking if something lacks waterproofing.
  • With stopwords removed: Becomes "waterproof" — the exact opposite of the user's intent.

Example 4: "The Who" / "The The" / "IT"

Band names and common abbreviations are particularly vulnerable. "IT" as an industry term gets stripped because "it" is on most stopword lists. "The Who" becomes just "Who."

Example 5: Phrase Queries

Stopword removal is especially destructive for phrase queries. A user searching for the exact phrase "to be or not to be" loses the entire query to stopword removal — every single word is on most stopword lists.

The Problem at Scale

These aren't edge cases. In a large-scale search system serving millions of queries, even a 1% failure rate from stopword removal represents thousands of broken user experiences per day.

Here's what the data typically shows when you audit stopword-related query failures:

Query PatternFailure ModeImpact
Proper nouns with articles ("The Mandalorian")Disambiguation lostWrong results
Negations ("not included")Meaning invertedDangerously wrong results
Prepositions as differentiators ("in" vs. "on")Spatial/contextual meaning lostIrrelevant results
Song/book/movie titlesComplete title destroyedZero or wrong results
Technical abbreviations ("IT", "OR", "AS")Terms stripped entirelyMissing results

How Scoring Models Handle Stopwords

Even without explicit removal, scoring models like BM25 naturally de-weight stopwords through their Inverse Document Frequency (IDF) component.

The IDF formula:

IDF(term) = log(1 + (N - n + 0.5) / (n + 0.5))

Where N is the total number of documents and n is the number of documents containing the term.

A stopword like "the" appears in nearly every document, so n ≈ N, making IDF close to zero. This means BM25 already assigns near-zero weight to stopwords — without removing them.

This is a crucial insight: BM25 already solves the scoring problem that stopword removal was designed to fix. The remaining justification is storage and performance optimization, not relevance.

The Performance Argument

The traditional argument for stopword removal is performance:

  1. Index size: Stopwords appear in nearly every document, so their posting lists are enormous. Removing them reduces index size significantly (20-30% in some cases).
  2. Query speed: Evaluating a term that matches 90% of documents is expensive. Removing it from the query avoids that cost.

These were compelling arguments when hardware was expensive and indexes were stored on spinning disks. In 2026, with modern SSDs, ample RAM, and efficient compression algorithms (like Lucene's block-based encoding), the performance benefits of stopword removal are much smaller than they used to be.

Modern mitigations:

  • LZ4/DEFLATE compression on posting lists dramatically reduces the storage overhead of high-frequency terms.
  • Block-max WAND (used in Lucene 9+) skips over low-scoring documents efficiently, so evaluating a stopword term doesn't scan the entire posting list.
  • Phrase queries with slop can use positional data to match phrases accurately, but only if stopwords are preserved in the index with positions.

The Right Approach: Nuanced Stopword Handling

Instead of the binary choice of "remove all stopwords" or "keep all stopwords," use a nuanced, domain-specific strategy.

1. Customize Your Stopword List

Don't use the default stopword list from your search engine. Build a domain-specific list:

  • Remove genuinely meaningless words for your domain.
  • Keep words that carry semantic weight in your context.
  • Review the list against your real query logs quarterly.

For an entertainment search engine, remove "the" from the stopword list entirely — it's too often part of titles. For a technical documentation search, remove "IT," "OR," "NOT" from the stopword list — they're meaningful terms.

2. Use Conditional Stopword Handling

Some search engines support treating stopwords as optional rather than removing them:

In Solr (eDisMax):

<str name="stopwords">stopwords.txt</str>
<str name="mm">100%</str>  <!-- require all terms -->

With mm=100%, stopwords are preserved in Boolean evaluation but won't dominate scoring because BM25's IDF naturally de-weights them.

In Elasticsearch: You can use minimum_should_match to control how optional terms (including stopwords) affect matching:

{
  "match": {
    "title": {
      "query": "the office",
      "minimum_should_match": "100%"
    }
  }
}

3. Index Stopwords But Make Them Optional in Queries

A balanced approach:

  1. Don't use a stopword filter at index time. Index everything, including stopwords. This preserves phrase matching capability and keeps the full text available.
  2. Use common grams or shingles to create paired tokens that include stopwords (e.g., "the_office", "not_waterproof"). This gives phrase queries high-quality matches without the overhead of full positional matching.
  3. At query time, let BM25's IDF handle de-weighting naturally. Stopwords will have minimal impact on scoring but will be available for phrase matching.

4. The Multi-Field Approach

For maximum flexibility, index the same content into two fields:

  • Field A: Analyzed with stopword removal (for broad recall and efficient scoring).
  • Field B: Analyzed without stopword removal (for phrase matching and precision).

Boost Field B higher for phrase queries and Field A for individual keyword queries. This gives you the best of both worlds.

Cross-Language Considerations

Stopword handling becomes even more nuanced in multilingual search:

French

French has more articles and prepositions that carry grammatical meaning. The word "à" (at/to) changes the meaning of a phrase completely: "café à emporter" (takeaway coffee) vs. "café emporter" (nonsensical). French stopword lists must be more conservative.

German

German compound words are a specific challenge. Stopwords embedded in compound words can't be removed without breaking the word: "Arbeitgeber" (employer) contains "Arbeit" (work) + "geber" (giver). Standard stopword removal doesn't interact with decompounders properly in all configurations.

Arabic

Arabic is a root-based language where prefixed articles (like "ال" — "al") are integral to word identity. Aggressive stopword removal can strip these prefixes and fundamentally change the meaning of terms.

CJK (Chinese, Japanese, Korean)

These languages typically don't have traditional stopwords in the Western sense. "Function words" serve grammatical purposes but are handled differently — through specialized tokenizers and bigram analysis rather than stopword lists.

Auditing Stopword Impact

If you have an existing search system, here's how to assess whether stopword removal is hurting your relevance:

Step 1: Extract Stopword-Containing Queries

Pull queries from your search logs that contain common stopwords. Filter for queries where stopwords are likely meaningful — proper nouns, negations, titles.

Step 2: Compare Results With and Without Stopwords

Run each query against your current index (with stopword removal) and a test index (without stopword removal). Compare:

  • Result rankings (position of expected results).
  • Zero-result rates.
  • Relevance judgments (have domain experts review both result sets).

Step 3: Measure the Impact

Track:

  • Zero-result rate change: Did preserving stopwords reduce zero-result queries?
  • nDCG@10 change: Did ranking quality improve for stopword-containing queries?
  • Latency change: Did preserving stopwords increase query times? (Usually minimal with modern engines.)

Step 4: Iterate on Your Stopword List

Based on the audit, build a custom stopword list. Start by removing obvious noise words, then add back any word that appears in queries where removal degrades relevance.

The Bottom Line

Stopwords are guilty until proven innocent — and too many search teams execute without a trial.

The default behavior of stripping stopwords was designed for a hardware-constrained era. Modern search engines handle high-frequency terms efficiently, and scoring models like BM25 naturally de-weight them.

Before applying stopword removal, ask: Does removing this word ever change the meaning of a real query? If the answer is yes — and it almost always is — keep it and let the scoring model do its job.

Productized Consulting

Apply Strategic Depth

Enterprise Only10M+ Documents

Enterprise Advisory

Strategic partnership for Engineering Leads and CTOs. I bridge the gaps in your Search, AI, and Distributed Infrastructure.

Retainer

Inquiry Only
Strategic Call
Deep-Dive3-Day Audit

RAG Health Audit

Diagnostics for retrieval precision, chunking strategy, and evaluation protocols.

Fixed Scope

€5k+
Strategic Call
Precision1-Week Sprint

Search Relevance

Hybrid Search implementation, scoring refinement, and analyzer tuning at the 1M+ level.

Performance

€3.5k+
Strategic Call
Previous
Vector Search: Beyond Keywords
Next
Speed vs. Relevance: Elasticsearch or Solr?
Weekly Architectural Depth

Search & Scale

Architectural deep-dives on building search, AI, and microservices for 10M+ environments. Delivered every week.

Search Relevance

Beyond BM25: Practical ways to tune vector & hybrid search for production.

RAG Architecture

Solving the retrieval precision and scale issues that kill hobby projects.

Engineering Scale

Java & Python microservices that handle 100M+ monthly requests with zero downtime.

Graph Databases

Empowering relationship-aware insights with graph databases and advanced analytics

Said Bouigherdaine
2.4k+Subscribers
42%Avg. Open Rate

Join the deep-dive.

Enter your email for architectural guides on scaling search and AI systems. Direct to your inbox.

Interested in:

No fluff. Just architecture. Unsubscribe anytime.