Typo-tolerant Query auto-completion (QAC) - derived from indexed documents

— work in progress, live reporting from the lab 😀 —
TLDR : Typo-tolerant Query auto-completion (QAC) - derived from indexed documents
Providing the user with query completions while typing has many benefits: it saves time, prevents and corrects spelling mistakes, makes sure queries are aligned with index content, helps the user to formulate queries if he is unsure about the terms, their spelling or the content of the index.
But implementing QAC at scale also comes with many challenges: how to collect and select the completions, with low memory requirement, low performance impact, and high relevance. And how to rank and serve them to the user with low latency, even for large indices and many concurrent users.
In the blog post we analyze those challenges, explore the solution space and present our open source implementation - both our standalone library SymSpellComplete as well as its integration into our SeekStorm search library and server.
This blog post describes two components of our recent work to build a typo-tolerant Query auto-completion (QAC) system, where the completions are derived from indexed documents as opposed from query-logs. We discuss our research into prior art and related work, algorithms, data structure, challenges in scaling and resource consumption, experiments and our final solution.
SymSpellComplete, a typo-tolerant autocomplete library in Rust.
Like a combination of SymSpell (repo) and PruningRadixTrie (repo), but better:
- Query terms might contain spelling errors, even in the first letter of an incomplete term:
blu kura->blue curacao - Handles missing and mistakenly inserted spaces:
modernart->modern art - Supports differing order between query and completion terms:
university dre->dresden university of technology - Allows out-of-vocabulary terms (not all query terms are present in completions):
teal merino cardi->merino cardigan - The SymSpellComplete library is implemented in Rust, and available as open-source under the MIT license.
Typo-tolerant Query Auto-Completion (QAC) in the SeekStorm full-text search library & multi-tenancy server, using SymSpellComplete.
- The completions are automatically derived in real-time from indexed documents, not from a query log.
- works, even if no query log is available, especially for domain-specific, newly created indices or few users.
- works for new or domain-specific terms.
- allows out-of-the-box domain specific suggestions
- prevents inconsistencies between completions and index content.
- Query logs contains invalid queries: because its sourced from a different index or because users don’t know the content of your index..
- Works for the long tail of queries, that never made it to a log.
- suggestions tailored per index.
- completion prevents spelling errors and saves time
- language-, content- and domain-independent
- additionally allows to use hand crafted completion files
- Ghosting: highlighting the suggested text within the search box in the UI
- The SeekStorm lexical search library and server are implemented in Rust, and available as open-source under the MIT license.
Requirements
- prefix or substring lookup
- typo-tolerant completion
- deal with spelling errors in both complete and incomplete terms, including the first letter
- word-order tolerant completion
- identify completions where all the query terms appear as prefixes of some of the terms of the completions, regardless their order.
- score completions by word and n-gram frequency, derived from the corpus
- low resource impact of completion generation, storage, and lookup: RAM, disc, processor, IO
- fast lookup time
- dynamic, self-learning from a stream of documents as opposed to immutable one-time batch generation
- scaling: preventing unlimited growth of completions lists and resource consumption as the index grows
- Integration of completions into the UI
Completion data sources
There are three approaches to obtain and distill query completion data:
From query logs
Deriving completions from a query log is the most popular approach, with both pros and cons:
- user curated: real users with quantifiable interest in those queries
- logs with invalid queries lead to invalid completions: misspelled terms, query doesn’t return results because there are no matching documents indexed.
- outdated query logs, missing completions: query popularity, missing new or domain specific terms and queries. Often suggestions are created once, and not updated or only with delay, when new documents with new terms are indexed.
- difficult to obtain: representative and statistically sufficient query logs, especially for new and domain specific indices with limited prior usage.
- privacy concerns: see https://en.wikipedia.org/wiki/AOL_search_log_release
- divergence between completion and indexed content: completions might be suggested for which no matching documents can be found in the index.
- due to the long-tail nature of queries, most of the queries have never seen before. This also means for most queries completion solely based of query logs won’t work.
See AmazonQAC: A Large-Scale, Naturalistic Query Autocomplete Dataset
From document corpus
Deriving completions from indexed documents is an alternate approach, also with both pros and cons:
- The completions are automatically derived in real-time from indexed documents, not from a query log:
- works, even if no query log is available, especially for domain-specific, newly created indices or few users.
- works for new or domain-specific terms.
- allows out-of-the-box domain specific suggestions
- prevents inconsistencies between completions and index content. Query logs contains invalid queries: because its sourced from a different index or because users don’t know the content of your index..
- Works for the long tail of queries, that never made it to a log.
There is an interesting paper Query Suggestions in the Absence of Query Logs by Sumit Bhatia, Debapriyo Majumdar, Prasenjit Mitra that introduces the general idea of deriving query completions by extracting uni-grams, bi-grams and tri-grams from the document corpus instead from the usual query log. Using the extracted set of completion n-grams, the paper also investigates the ranking and precision of query completions generated by different query suggestion methods.
Unfortunately, many considerations required for production readiness are missing:
- there are no consideration about scaling the approach for large indices,
- RAM, disk and processor consumption and limits,
- n-gram extraction performance,
- lookup performance,
- completeness of suggestions,
- effective storage and compression of completions,
- measures to reduce resource consumption (term frequency thresholds, capping the number of stored completions), while achieving a good cost (processor, disk space, RAM consumptions) vs. precision and recall ratio,
- realtime updates and availability of suggestions from a document stream as opposed to a one-time batch processing leading to static immutable completions,
- how to deal with non-uniformly distributed terms and n-grams, e.g. new products occurring late, or only within a specific range of documents
- typo-tolerance, especially when typos at the beginning of the word prevent the use of tries.
Additional considerations
- By specifying from which fields of the documents the completions should be derived, e.g. from title field, product name field, we can increase the relevancy (precision) of completions, at the cost of completeness (recall).
From result documents (SERP)
Another approach is extracting completions from previous search results, given that we are using instant search where we return results for every key stroke, or at least every term that has been completely entered.. If we search for “data encryption algor…” there has been previously the query “data encryption”. In the search results of that query we can analyze, what terms are following the matches of the query terms within each returned document. We are extracting completions on-the-fly from previously returned documents.
Again, there are pros and cons:
- no extraction preprocessing of query log or corpus required
- no storage of completion lists required
- for small result sets the extracted suggestions will be limited. The relevance of search results doesn’t ensure the relevancy and broadness of completions.
- for more complete and relevant suggestions we need to increase the result set of previous queries, introducing extra query latency and resource consumption.
- extraction of completion from a large number of large result documents introduces extra suggestion latency.
Completion lookup methods
Once we have derived the completions, either form query logs or document corpus, we need to to lookup the matching completions for a given prefix or substring of an incomplete query. There are many feasible approaches, differing by result type, compactness and lookup time:
prefix search
- Pruning radix trie
- Radix trie / patricia trie
- BTreeMap with ranges
- Ternary search tree
- other Tries
- Completion Trie (CT), Score-Decomposed Trie (SDT), and RMQ Trie (RT)
- DynSDT
Drawbacks of all tree/trie-based prefix-search algorithms:
- Not typo-tolerant
- Not word-order tolerant
substring search
- Indexed Regular Expression Search based on character-wise 3-grams (trigrams), more, more.
conjunctive search
Conjunctive search identifies completions where all the query terms appear as prefixes of some of the terms of the completions, regardless their order.
UI integration of Completions
Ghosting is the process of auto-completing a search recommendation by highlighting the suggested text inline i.e., within the search box. Ghosting increases the acceptance of offered suggestions, reduces misspelled searches, and improves sales.
