Stopwords are Not as Harmless as They Look
The Default That Breaks Things
Words like "the," "in," "at," "of," "to," "is" — search engines call them stopwords and most systems strip them out by default. The reasoning seems sound: these words appear in nearly every document, carry minimal semantic weight, and inflate index size. Removing them makes the index smaller and queries faster.
But in search relevance, removing them can break real queries in ways that are immediately visible to users and silently devastating to metrics.
When Stopwords Carry Meaning
The assumption behind stopword removal is that these words are informationally empty — noise that dilutes relevance scores. That assumption is wrong more often than most search engineers realize.
Example 1: "The Office"
- With stopwords preserved: Matches the TV show The Office.
- With stopwords removed: Becomes just
"office"→ matches office furniture, office supplies, coworking spaces, corporate dashboards.
The word "the" here is not filler. It's a disambiguator. It transforms a generic noun into a specific proper noun.
Example 2: "What to Expect When You're Expecting"
- With stopwords preserved: Matches the book/movie title exactly.
- With stopwords removed: Becomes
"expect expecting"— which could match anything about project expectations, delivery timelines, or weather forecasts.
Example 3: "Not Waterproof"
- With stopwords preserved: Correctly indicates a negation — the user wants products that are NOT waterproof, or is checking if something lacks waterproofing.
- With stopwords removed: Becomes
"waterproof"— the exact opposite of the user's intent.
Example 4: "The Who" / "The The" / "IT"
Band names and common abbreviations are particularly vulnerable. "IT" as an industry term gets stripped because "it" is on most stopword lists. "The Who" becomes just "Who."
Example 5: Phrase Queries
Stopword removal is especially destructive for phrase queries. A user searching for the exact phrase "to be or not to be" loses the entire query to stopword removal — every single word is on most stopword lists.
The Problem at Scale
These aren't edge cases. In a large-scale search system serving millions of queries, even a 1% failure rate from stopword removal represents thousands of broken user experiences per day.
Here's what the data typically shows when you audit stopword-related query failures:
| Query Pattern | Failure Mode | Impact |
|---|---|---|
| Proper nouns with articles ("The Mandalorian") | Disambiguation lost | Wrong results |
| Negations ("not included") | Meaning inverted | Dangerously wrong results |
| Prepositions as differentiators ("in" vs. "on") | Spatial/contextual meaning lost | Irrelevant results |
| Song/book/movie titles | Complete title destroyed | Zero or wrong results |
| Technical abbreviations ("IT", "OR", "AS") | Terms stripped entirely | Missing results |
How Scoring Models Handle Stopwords
Even without explicit removal, scoring models like BM25 naturally de-weight stopwords through their Inverse Document Frequency (IDF) component.
The IDF formula:
IDF(term) = log(1 + (N - n + 0.5) / (n + 0.5))
Where N is the total number of documents and n is the number of documents containing the term.
A stopword like "the" appears in nearly every document, so n ≈ N, making IDF close to zero. This means BM25 already assigns near-zero weight to stopwords — without removing them.
This is a crucial insight: BM25 already solves the scoring problem that stopword removal was designed to fix. The remaining justification is storage and performance optimization, not relevance.
The Performance Argument
The traditional argument for stopword removal is performance:
- Index size: Stopwords appear in nearly every document, so their posting lists are enormous. Removing them reduces index size significantly (20-30% in some cases).
- Query speed: Evaluating a term that matches 90% of documents is expensive. Removing it from the query avoids that cost.
These were compelling arguments when hardware was expensive and indexes were stored on spinning disks. In 2026, with modern SSDs, ample RAM, and efficient compression algorithms (like Lucene's block-based encoding), the performance benefits of stopword removal are much smaller than they used to be.
Modern mitigations:
- LZ4/DEFLATE compression on posting lists dramatically reduces the storage overhead of high-frequency terms.
- Block-max WAND (used in Lucene 9+) skips over low-scoring documents efficiently, so evaluating a stopword term doesn't scan the entire posting list.
- Phrase queries with slop can use positional data to match phrases accurately, but only if stopwords are preserved in the index with positions.
The Right Approach: Nuanced Stopword Handling
Instead of the binary choice of "remove all stopwords" or "keep all stopwords," use a nuanced, domain-specific strategy.
1. Customize Your Stopword List
Don't use the default stopword list from your search engine. Build a domain-specific list:
- Remove genuinely meaningless words for your domain.
- Keep words that carry semantic weight in your context.
- Review the list against your real query logs quarterly.
For an entertainment search engine, remove "the" from the stopword list entirely — it's too often part of titles. For a technical documentation search, remove "IT," "OR," "NOT" from the stopword list — they're meaningful terms.
2. Use Conditional Stopword Handling
Some search engines support treating stopwords as optional rather than removing them:
In Solr (eDisMax):
<str name="stopwords">stopwords.txt</str>
<str name="mm">100%</str> <!-- require all terms -->
With mm=100%, stopwords are preserved in Boolean evaluation but won't dominate scoring because BM25's IDF naturally de-weights them.
In Elasticsearch:
You can use minimum_should_match to control how optional terms (including stopwords) affect matching:
{
"match": {
"title": {
"query": "the office",
"minimum_should_match": "100%"
}
}
}
3. Index Stopwords But Make Them Optional in Queries
A balanced approach:
- Don't use a stopword filter at index time. Index everything, including stopwords. This preserves phrase matching capability and keeps the full text available.
- Use common grams or shingles to create paired tokens that include stopwords (e.g., "the_office", "not_waterproof"). This gives phrase queries high-quality matches without the overhead of full positional matching.
- At query time, let BM25's IDF handle de-weighting naturally. Stopwords will have minimal impact on scoring but will be available for phrase matching.
4. The Multi-Field Approach
For maximum flexibility, index the same content into two fields:
- Field A: Analyzed with stopword removal (for broad recall and efficient scoring).
- Field B: Analyzed without stopword removal (for phrase matching and precision).
Boost Field B higher for phrase queries and Field A for individual keyword queries. This gives you the best of both worlds.
Cross-Language Considerations
Stopword handling becomes even more nuanced in multilingual search:
French
French has more articles and prepositions that carry grammatical meaning. The word "à" (at/to) changes the meaning of a phrase completely: "café à emporter" (takeaway coffee) vs. "café emporter" (nonsensical). French stopword lists must be more conservative.
German
German compound words are a specific challenge. Stopwords embedded in compound words can't be removed without breaking the word: "Arbeitgeber" (employer) contains "Arbeit" (work) + "geber" (giver). Standard stopword removal doesn't interact with decompounders properly in all configurations.
Arabic
Arabic is a root-based language where prefixed articles (like "ال" — "al") are integral to word identity. Aggressive stopword removal can strip these prefixes and fundamentally change the meaning of terms.
CJK (Chinese, Japanese, Korean)
These languages typically don't have traditional stopwords in the Western sense. "Function words" serve grammatical purposes but are handled differently — through specialized tokenizers and bigram analysis rather than stopword lists.
Auditing Stopword Impact
If you have an existing search system, here's how to assess whether stopword removal is hurting your relevance:
Step 1: Extract Stopword-Containing Queries
Pull queries from your search logs that contain common stopwords. Filter for queries where stopwords are likely meaningful — proper nouns, negations, titles.
Step 2: Compare Results With and Without Stopwords
Run each query against your current index (with stopword removal) and a test index (without stopword removal). Compare:
- Result rankings (position of expected results).
- Zero-result rates.
- Relevance judgments (have domain experts review both result sets).
Step 3: Measure the Impact
Track:
- Zero-result rate change: Did preserving stopwords reduce zero-result queries?
- nDCG@10 change: Did ranking quality improve for stopword-containing queries?
- Latency change: Did preserving stopwords increase query times? (Usually minimal with modern engines.)
Step 4: Iterate on Your Stopword List
Based on the audit, build a custom stopword list. Start by removing obvious noise words, then add back any word that appears in queries where removal degrades relevance.
The Bottom Line
Stopwords are guilty until proven innocent — and too many search teams execute without a trial.
The default behavior of stripping stopwords was designed for a hardware-constrained era. Modern search engines handle high-frequency terms efficiently, and scoring models like BM25 naturally de-weight them.
Before applying stopword removal, ask: Does removing this word ever change the meaning of a real query? If the answer is yes — and it almost always is — keep it and let the scoring model do its job.
Apply Strategic Depth
Enterprise Advisory
Strategic partnership for Engineering Leads and CTOs. I bridge the gaps in your Search, AI, and Distributed Infrastructure.
Retainer
Inquiry OnlyRAG Health Audit
Diagnostics for retrieval precision, chunking strategy, and evaluation protocols.
Fixed Scope
€5k+Search Relevance
Hybrid Search implementation, scoring refinement, and analyzer tuning at the 1M+ level.
Performance
€3.5k+