Wolf Garbe
Wolf Garbe CEO and co-founder of SeekStorm

N-gram index for faster phrase search: latency vs. size

N-gram index for faster phrase search: latency vs. size

TLDR;

N-Gram indexing improved phrase mean query latency by a factor of 2.18 (118%),
p99.9 tail latency by a factor of 7.63 (663%),
and some phrase queries up to 3 orders of magnitude.

The SeekStorm sub-millisecond lexical search library & multi-tenancy server has been implemented in Rust,
and is open source licensed under the Apache License 2.0, and is available on GitHub.

Not all queries and data are made equal

If you strive for the best possible performance, there is no one-fits-all approach.

Depending on the query type (union, intersection, phrase) and on the term frequency (posting list length) of your data,
many optimizations can be applied: WAND, MAXSCORE, roaring bitmaps, SIMD, galloping. Another interesting optimization is N-gram indexing.

What are N-grams

In general, N-grams are contiguous sequences of n adjacent items (words, characters, or symbols) extracted from a text.

In SeekStorm, N-grams are contiguous sequences of two or three frequent and/or rare terms as opposed to character sequences (see below for a list of supported N-gram types), extracted from a document or query.

What are frequent terms?

SeekStorm allows different N-gram types, each defined as a combination of two or three frequent and/or rare terms. But what is the distinction between a frequent and a rare term? It is the number of times a term occurs within the index. Rarer terms have a higher discriminatory value, which is also honoured by the BM25 formulae. Frequent terms, on the other hand, have a lower discriminatory value, but due to their frequent occurrence, they take up a lot of index space and computing resources during intersection. Therefore, they are sometimes even excluded from indexing as stop words. But this is no perfect solution, as there are valid queries, like “The Who” or “The Doors”, consisting of or containing frequent terms. But where is the exact frontier, where a rare term becomes a frequent term and vice versa? There is no general definition. It’s up to the user to define the frequent terms, and all other terms are treated as rare terms. In SeekStorm, this is done by providing a list of user-defined frequent terms during index creation.

When defining frequent terms, you can simply take what would be considered stop words in the language of the documents you are indexing. The more terms you add to the frequent terms list, the more queries will benefit from the latency advantage of N-grams, at the cost of a higher index size. The more frequent a term is, that is added to the frequent term list, the more pronounced is its latency benefit compared to its induced index size cost.

Frequent terms are rare - i.e., there are only a few terms that frequently appear in almost every document. We prefer combinations of frequent terms in n-grams for two reasons:

  1. Frequent terms have long posting lists, and their conventional intersection would be expensive. By using N-grams instead, we can save the most, which otherwise would be the most expensive.
  2. Because there are only a few frequent terms, the number of combinations with frequent terms and thus their additional index size cost is limited too. We optimize the fat head instead of the long tail.

How about automatic detection of frequent terms during indexing?

Benefits

  • language independent,
  • no pre-definition of frequent terms required

Drawbacks

  • When indexing starts, we have no information about term frequency. We either need to index the first block without frequent terms and N-grams, or we need to reindex it once we have term frequencies collected.
  • Term frequencies might also change during indexing. Then we either need to reindex everything from scratch, or have different frequent terms per level, which would also require a different query rewriting to N-grams per block.

In SeekStorm, we forgo automatic detection of frequent terms, which makes both indexing and query more complicated and complex, in favor of a predefined set of frequent terms during index definition.

How does N-gram phrase search work?

The intersection of frequent terms is challenging because their posting lists are very long, and the intersection takes a lot of time. Especially when very frequent terms (aka stopwords) are involved, which occur in almost every document, that might significantly contribute to tail latencies.

Examples of those queries are “The Who”, “Take That”, “Who is Who”, “Let it be”, “To be or not to be”, “End of Days”, “The The”, “What might have been”.

One way to significantly reduce the query latency is by indexing N-grams in addition to single terms, essentially moving the computation of intersection and phrase matching from query time to indexing time.

When we use N-gram indexing also for phrase search of rare terms besides frequent terms, or even for intersection (AND) by removing the condition that the N-gram terms have to be adjacent, this significantly increases the index size.

Such an approach has been used in the distributed web search engine FAROO. To keep the index size manageable, highly discriminative keys (HDK), noun-only indexing, aggressive stemming, and other keyword reduction techniques have been used.

But when N-grams are restricted only to adjacent frequent terms, then the additional index cost in terms of space and time is negligible. On the other hand, those queries containing multiple adjacent frequent terms might be accelerated by 3 orders of magnitude,
because the posting list of the N-gram is much shorter than those of the original single terms. Also, for phrase queries of two or three frequent terms, no intersection and phrase matching is necessary anymore.
The phrase matching of consecutive N-grams (sequences containing frequent terms) and single terms (rare terms) is accelerated as well, as it becomes the intersection between a very short N-gram posting list and the other terms. This enables the use of additional accelerating techniques like galloping.

While BM25 is not ideal for phrase search as it doesn’t honour term proximity, we might still want to use it for comparability so that the BM25 score and the topk results are identical whether we are using N-gram indexing (for performance) or not.

To make N-gram indexing compatible with BM25 ranking, we have to implement additional measures. We store the posting_counts (DF) and the positions_count of the partial terms (TF) of the N-gram for BM25 calculation. While we could have used the positions_count stored in the posting lists of single terms, it would be expensive to find the docid within the posting lists (docid/pointer list) of single terms via custom binary search. That would nullify the speed advantage of N-gram indexing. Therefore, we sacrifice some space to store the positions_counts of the single term together with the positions of the N-gram.

N-gram indexing is orthogonal to other acceleration techniques mentioned above.

A traditional phrase search consists of two steps:

  1. First, we intersect two single-term posting lists (e.g., by merging and galloping).
  2. Then, for each intersection match, we have to do a phrase check, by checking whether or not the term positions in the document are adjacent in the order of the phrase query terms. For long posting lists that can be expensive.

Both steps - intersection and phrase check - can be eliminated if we already precalculate N-grams (bigram and trigram phrase matches) at index time. Then, at query time, for two- and three-term queries, the result list is directly available by accessing the appropriate n-gram posting lists.

For phrase queries longer than three terms, the queries can be transformed to queries with fewer terms, consisting of n-grams.
After rewriting the query to N-grams, we have fewer terms, each with shorter posting lists, reducing the effort for intersection and the subsequent phrase check.

N-grams and tokenizer

N-grams are generated in the tokenizer, both for indexing and search.

For indexing all possible variants of overlapping N-grams are generated. For searching a single variant of non-overlapping N-grams is chosen.

We currently use a greedy algorithm for selecting the best N-grams for query rewriting.
We have also experimented with dynamic programming, and I adapted my Fast Word Segmentation algorithm (already ported from C# to Rust for the Chinese word segmentation in SeekStorm) for N-gram query rewriting using the Viterbi algorithm (Naive Bayes). But according to my benchmarks, the additional effort for Viterbi over Greedy was not always justified.

N-grams and BM25

BM25 scores (SimilarityType::Bm25f) are almost identical for both n-gram and single-term indexing. There are only small differences for phrase search resulting from normalization (32bit->8bit->32bit lossy logarithmic compression/decompression) that is used for posting_count_ngram1/2/3, but not for single term posting_counts.

This is achieved as we store for every partial term of an n-gram its document frequency (DF), i.e. the number of documents a term occurs in. We also store for every partial term of an n-gram its term frequency (TF), i.e., the number of positions within a document.

For the alternative BM25 proximity score (SimilarityType::Bm25fProximity) we use DF and TF of the N-gram, as opposed to the DF and TF of its partial terms. This way, the proximity of terms within an N-gram is honored for scoring.
But then we can’t anymore score the posting list length and position count of the individual 2 or 3 ngram terms independently,
as we store only the posting list length and position count of the N-gram itself, but not its partial terms.

Example query: “tallest trees in the world”

 Top-10 DocID  SingleTerm Bm25f  Top-10 DocID  faithful N-gram Bm25f  Top-10 DocID  N-gram Bm25fProximity 
245600116.940866245600117.1310888919738.5893530
161502916.919497161502917.11397024560018.5521690
89197316.16561989197316.35179316150297.5108185
209454313.590414209454313.77601020945436.4961680
116253413.225990116253413.41311211625346.2484310
26765417.422784326765417.598211334944564.9255943
34944567.353790334944567.518487033085534.3953430
33085537.094280733085537.26256379588614.1408840
9588616.99758869588617.167287026765412.5677156
35138623.101116435138623.263711535138621.6718745


The small differences between SingleTerm Bm25f and faithful N-gram Bm25f are caused by lossy logarithmic compression/decompression of posting_count_ngram1/2/3, but the order of the top-10 DocID is mostly preserved.

SeekStorm N-gram indexing

SeekStorm had bigram indexing of two adjacent frequent terms from the very beginning.
This allows for very fast phrase search of phrase mode of frequent terms, e.g., “The Who” (the rock band).
And bi-grams of frequent terms deliver ‘the best bang for the buck’ in terms of latency improvement vs. index size increase trade-off, as we will see below.

But starting with SeekStorm v0.13.0, we go a step further with a more generalized approach to n-gram indexing.
Instead of only indexing two frequent terms as a bigram in addition to the single term, we allow fo indexing the following term combinations as N-grams:

  • SingleTerm
  • NgramFF : frequent frequent
  • NgramFR : frequent rare
  • NgramRF : rare frequent
  • NgramFFF : frequent frequent frequent
  • NgramRFF : rare frequent frequent
  • NgramFFR : frequent frequent rare
  • NgramFRF : frequent rare frequent

During index creation, the user can define any combination of the above N-gram types to be supported during indexing and search. SingleTerm indexing is always enabled.

N-gram indexing reduces the latency, but only for phrase search. Some phrase queries are 3 orders of magnitude faster.
But N-gram indexing comes at the cost of increasing both index size and indexing time.

While it’s not easy to go faster if you are already using the fastest processor, it is easy to go faster with N-gram indexing by sacrificing some storage space, as you usually haven’t maxed out disk capacity yet.
Also, upgrading storage is usually much more affordable than upgrading to a faster processor.

With N-gram indexing, SeekStorm follows the successful SymSpell pattern: trade index space and indexing time for query speed - as an option.

We provide benchmarks for latency, indexing speed, and index size to allow the user to make an informed decision about when to use N-gram indexing.

Every use case is different, with different data and different trade-off preferences. With SeekStorm v0.13.0, you now have a choice.

How about rewriting AND queries to Phrase queries?

Unfortunately, the use of explicit phrase queries by typing “” is not internalized by the average user, even if his intent was obviously to perform a phrase query.

How about rewriting AND queries to Phrase queries? This would allow AND queries to profit from the faster speed of N-gram phrase queries.
But it comes at a cost. If the AND query was not intended as a phrase query, we might get insufficient results with bad recall.

There are some heuristics. If the user is searching for two frequent terms or a rare and a frequent term, it is most likely intended as a phrase query, just because a frequent term outside a phrase doesn’t add much discriminatory value to the query.

Another way is to start with rewriting the query as a phrase query. This will not only speed up the query but also lead to more relevant results.
In the case that not enough results are found, we could fall back to a normal AND/OR query.

Traditional Phrase search is always slower than AND search, because Phrase search = AND search + additional Phrase check.
But for N-gram Phrase search, usually the opposite is true. That’s why rewriting AND queries to N-gram Phrase queries is worth considering.

Not exactly. Because n-grams are consecutive occurrences of terms in a document, they are suitable for phrase search only. In AND search, the terms of a query might occur everywhere within a document, and thus wouldn’t be covered by N-grams.

But there is something slightly similar: Keyword combination indexing. We could derive all keyword combinations that occur within a document and index those combinations in addition to the single terms and N-grams. This would be essentially a precalculation of AND search at indexing time, and offer the same speed benefits to AND search that N-Gram indexing offers for Phrase search.

There is only one problem: The associated cost of naive keyword combination indexing in terms of index size and indexing time is unbearable. The number of combinations without repetition is C(n, k) = ∑ᵢ₌₁..ₖ n! / (i!(n-i)!), where n is the number of unique terms in a document and k is the keyword combination length, i.e., the maximum number of terms in an AND query we want to support = the maximum number of terms in a keyword combination we need to index.

E.g., a document with 200 unique terms and a maximum query length of 5 ≈ 2,600,000,000 combinations we need to index!

Luckily, there are several compression methods to significantly reduce the number of keyword combinations per document:

  • Stopword removal
  • Stemming
  • Highly discriminative keys (HDK)
  • Indexing title only
  • Indexing highly relevant documents only
  • verb removal - indexing nouns only

HDK (high discriminative key) is an interesting concept, where keyword combinations are only formed from frequent terms and term combinations, until the resulting term combination becomes intrinsically rare.
In a second step, within those intrinsically rare posting lists, a full check of all query terms is done within bloom filters of terms associated with every document in the intrinsically rare posting list.
For details, see Scalable Peer-to-Peer Web_Retrieval with Highly Discriminative Keys.

Some of the above methods are reducing recall, but increasing relevancy. This could be useful in a two-tier approach, where we first try to find results with fast keyword combination search, and only if that fails, we would fall back to slow conventional intersection.

I have used keyword combination indexing successfully 20 years ago when implementing the FAROO decentralized peer-to-peer web search engine. We used a DHT (distributed hash table) to store the posting lists at different peers (with a 20-fold redundancy to fight churn). As it was infeasible to transfer long posting lists between peers and users in real-time at search time, we had to find a solution for intersection without posting list transfer. That’s why we back then chose keyword combination indexing.

Benchmarks

Benchmark results highly depend on the selected data and query set, as well as on the hardware and software environment.

We are using the English Wikipedia data and queries derived from the AOL query dataset, both from Tantivy’s search-benchmark-game. The benchmark has been executed in single-thread, single-shard mode without caching.

The results were generated using the following benchmark environment:

  • ThinkPad P1 Gen 6
  • Processor Intel(R) Core(TM) i7-13700H
  • SK hynix 2TB SSD PCIe 4.0 NVMe M.2
  • Ubuntu 24.04 LTS
  • Rust 1.88
  • OpenJDK 17

Comparison of different N-gram type combinations in terms of phrase query latency

phrase search tail latencies



LibraryN-gram typemean (µs)factorp50 (µs)factorp95 (µs)factorp99 (µs)factorp99.9 (µs)factor
SeekStorm 0.13.0Single Terms1,0901.003201.002,9901.0016,9101.0060,4901.00
SeekStorm 0.13.0Frequent Bigrams8901.223201.003,1800.9410,2301.6526,0702.32
SeekStorm 0.13.0Frequent Bigrams/Frequent Trigrams8801.243201.002,8501.058,9501.8924,4302.48
SeekStorm 0.13.0Frequent Bigrams/Mixed Trigrams8201.333101.032,8901.038,4402.0024,8402.44
SeekStorm 0.13.0Mixed Bigrams5102.142601.231,0502.843,8004.458,0507.51
SeekStorm 0.13.0Mixed Bigrams/Frequent Trigrams5002.182601.231,8901.583,8204.437,9307.63
SeekStorm 0.13.0Mixed Bigrams/Mixed Trigrams5102.142701.191,7901.673,8504.398,1807.39


7 (Frequent Bigrams) vs. 51 (Mixed Bigrams + Mixed Trigrams) of 903 queries in the benchmark are benefiting from N-gram indexing. Those are usually the most challenging ones, causing high tail latencies.

N-gram indexing can help to drastically reduce tail latencies.

For more information on tail latencies see our blog post Tail latencies and percentiles.

The cost of N-grams: index size and indexing time

LibraryN-gram typeIndex size (byte) factor Indexing time (s) factor
SeekStorm 0.13.0Single Terms3.880.714.4111.005581.00
SeekStorm 0.13.0Frequent Bigrams4.476.185.3971.156051.08
SeekStorm 0.13.0Frequent Bigrams/Frequent Trigrams4.613.052.3741.196361.14
SeekStorm 0.13.0Frequent Bigrams/Mixed Trigrams9.680.952.8712.4922033.95
SeekStorm 0.13.0Mixed Bigrams11.314.218.4832.9221383.83
SeekStorm 0.13.0Mixed Bigrams/Frequent Trigrams11.531.297.5872.9722043.95
SeekStorm 0.13.0Mixed Bigrams/Mixed Trigrams16.599.113.3744.2841647.46


Some challenging phrase queries containing frequent terms:

“the who”

LibraryN-gram typelatency (µs) factor N-gram query rewriting
SeekStorm 0.13.0Single Terms151,850 1.00the who
SeekStorm 0.13.0Frequent Bigrams120 1,265.00the_who
SeekStorm 0.13.0Frequent Bigrams/Frequent Trigrams140 1.084.64the_who
SeekStorm 0.13.0Frequent Bigrams/Mixed Trigrams130 1,178.08the_who
SeekStorm 0.13.0Mixed Bigrams90 1,687.00the_who
SeekStorm 0.13.0Mixed Bigrams/Frequent Trigrams80 1,898.12the_who
SeekStorm 0.13.0Mixed Bigrams/Mixed Trigrams140 1.084.64the_who


“the doors”

LibraryN-gram typelatency (µs) factor N-gram query rewriting
SeekStorm 0.13.0Single Terms8,160 1.00the doors
SeekStorm 0.13.0Frequent Bigrams9,500 0.86the doors
SeekStorm 0.13.0Frequent Bigrams/Frequent Trigrams7,480 1.09the doors
SeekStorm 0.13.0Frequent Bigrams/Mixed Trigrams8,890 0.92the doors
SeekStorm 0.13.0Mixed Bigrams130 62.77the_doors
SeekStorm 0.13.0Mixed Bigrams/Frequent Trigrams140 58.29the_doors
SeekStorm 0.13.0Mixed Bigrams/Mixed Trigrams140 58.29the_doors


“who is who”

LibraryN-gram typelatency (µs) factor N-gram query rewriting
SeekStorm 0.13.0Single Terms65,220 1.00who is who
SeekStorm 0.13.0Frequent Bigrams8,280 7.88who_is who
SeekStorm 0.13.0Frequent Bigrams/Frequent Trigrams70 931.71who_is_who
SeekStorm 0.13.0Frequent Bigrams/Mixed Trigrams80 815.25who_is_who
SeekStorm 0.13.0Mixed Bigrams9,940 6.56who_is who
SeekStorm 0.13.0Mixed Bigrams/Frequent Trigrams80 815.25who_is_who
SeekStorm 0.13.0Mixed Bigrams/Mixed Trigrams60 1,087.00who_is_who


“to be or not to be”

LibraryN-gram typelatency (µs) factor N-gram query rewriting
SeekStorm 0.13.0Single Terms60,880 1.00to be or not to be
SeekStorm 0.13.0Frequent Bigrams2,190 27.80to_be or_not to_be
SeekStorm 0.13.0Frequent Bigrams/Frequent Trigrams220 276.73to_be_or not_to_be
SeekStorm 0.13.0Frequent Bigrams/Mixed Trigrams240 253.67to_be_or not_to_be
SeekStorm 0.13.0Mixed Bigrams4,080 14.92to_be or_not to_be
SeekStorm 0.13.0Mixed Bigrams/Frequent Trigrams250 243.52to_be_or not_to_be
SeekStorm 0.13.0Mixed Bigrams/Mixed Trigrams170 358.12to_be_or not_to_be


“pump it up”

LibraryN-gram typelatency (µs) factor N-gram query rewriting
SeekStorm 0.13.0Single Terms2,220 1.00pump it up
SeekStorm 0.13.0Frequent Bigrams200 11.10pump it_up
SeekStorm 0.13.0Frequent Bigrams/Frequent Trigrams210 10.57pump it_up
SeekStorm 0.13.0Frequent Bigrams/Mixed Trigrams50 44.40pump_it_up
SeekStorm 0.13.0Mixed Bigrams310 7.16pump_it up
SeekStorm 0.13.0Mixed Bigrams/Frequent Trigrams310 7.16pump_it up
SeekStorm 0.13.0Mixed Bigrams/Mixed Trigrams70 31.71pump_it_up


Not all that is possible makes sense for everybody, with every workload. Here we propose a sensible trade-off between query performance improvement and index size cost, while preventing diminishing returns.

Looking at the benchmark results above and the associated index size cost, at first glance, it looks unreasonable to accept a 3x higher index size.
But in order to reduce tail latency by a factor of 8, it would be much easier and less expensive to increase the disk capacity by a factor of 3, than to have 8x faster processors or increase the node number by a factor of 8.
Furthermore, for medium-sized indices (<100 GB), index size is usually no concern at all, but (tail) latency is.

When minimum index size is more important than phrase query latency, we recommend Single Terms:

1
NgramSet::SingleTerm as u8

For a good balance of latency and index size cost, we recommend Single Terms + Frequent Bigrams + Frequent Trigrams:

1
NgramSet::SingleTerm as u8 | NgramSet::NgramFF as u8 | NgramSet::NgramFFF

When minimal phrase query latency is more important than low index size, we recommend Single Terms + Mixed Bigrams + Frequent Trigrams:

1
NgramSet::SingleTerm as u8 | NgramSet::NgramFF as u8 | NgramSet::NgramFR as u8 | NgramSet::NgramRF | NgramSet::NgramFFF

Summary

N-Gram indexing improved latency for some phrase queries, at the cost of higher index size and indexing time. The improvements for tail latencies are even higher than for mean query latency.

Using N-gram indexing, the following phrase search improvements have been measured:

  • mean query latency by factor 2.18 (118%)
  • p99.9 tail latency by factor 7.63 (663%)
  • Some queries are up to 3 orders of magnitude faster: e.g., “the who” by a factor of 1687 (168,600%)

To implement N-gram indexing for faster phrase search, 30% of 25K LoC of the SeekStorm codebase have been added, removed, or changed.

Further improvements

So far, we have indexed the N-grams that occurred in the documents, independent of the likelihood that they will appear in queries. With a probabilistic approach (i.e., via query log file analysis), we can judge how likely certain N-grams are to appear in queries, and we can abstain from indexing less likely N-grams.

The more likely N-grams can be stored in a Bloom filter, which is consulted during indexing and search to determine whether or not to combine terms into N-grams. This will significantly reduce the index size and indexing time overhead of N-gram indexing, while mostly maintaining its query latency benefits.

The achieved space savings can be used to increase the number of terms deemed to be frequent, and thus further improve (tail) latencies for more phrase queries.

Another improvement are N-grams with negative offsets: instead of rewriting “who is who” to “who_is who” we could rewrite it to “who_is is_who”, where the second bigram would have a offset of -1 for phrase check. This would lead to a shorter posting list for the second bigram ‘is_who’, compared to the posting list of a single term ‘who’, reducing latency.

Source code

The SeekStorm sub-millisecond lexical search library & multi-tenancy server has been implemented in Rust,
is open source licensed under the Apache License 2.0, and is available on GitHub.

Rating: