Vector search vs. Keyword search - Data structures and algorithms
Photo by Mateusz Wacławek
This is part 2 of a 4-part blog post series on Vector vs. keyword search:
- War of the worlds vs. we come in peace
- Data structures and algorithms
- LSMT-IVF for Billion-Scale Approximate Nearest Neighbor Search
Vector vs. keyword search 2: Data structures and algorithms
Now lets’s take a closer look at the inner workings of both keyword search and vector search. Only if we take the time to really understand the core principles of both approaches we will be able to identify where exactly differences and similarities lie and how we could utilize synergies and combine technologies in a way that they are complimentary.
Linear search is the most straightforward and most inefficient way to search. All documents are sequentially scanned for the occurrence of the query keyword. That means that for each single query we have to evaluate each single byte of all the indexed documents. Query time doesn’t scale for a large number of documents, especially if the index size exceeds the RAM limit and we need to index the documents on HDD or SSD.
The benefits of linear search are the simple implementation of both indexing and search, a high sequential writing speed, as well as a minimal index storage size.
The inverted index is a set of posting lists - one for each keyword. A posting list is an ordered list of document id, that identify all documents where that specific keyword occurs. sometimes it also contains positional information, where in the document that keyword occurs to allow for phrase and proximity search.
The benefit of an inverted index is the high lookup speed, where a query is reduced to a single random access. That comes at the cost of increased storage size, and in the naive implementation with a slow index speed.
Log-structured merge tree
The Log-structured merge tree (LSM tree) has the function to increase the indexing speed. When a document is indexed and parsed into single terms, those terms arrive in a quasi-random order. But the inverted index requires that those terms and the associated document id are stored into the inverted index, specifically into the posting lists that are ordered by keywords. In the naive approach that would require for each term a random access to the posting list. As long as the index is small and fits into RAM that is no problem. But if we want that the index size can scale beyond RAM and we need to index to HDD or SSD than that is a challenge. Not only that writing speed into SSD is slower than into RAM, also the random access writing speed is by orders of magnitude slower than the sequential writing speed.
Here comes the LSM data structure into play. It allows to sequentially index the stream of keyword/docid pairs, but later transforms over several intermediate steps those data into random-access oriented posting list. That allows to index with sequential speed, but to search with random access.
And random access at query time is exactly what we want, as we don’t want to scan the whole index of several GB or TB for each query. And the slower random access speed doesn’t matter for a single access to the index, compared to billions of sequential accesses.
The benefit of an inverted index with LSM tree is the increased indexing speed, that allows to utilize slow HDD or SDD store to scale for large document numbers. It comes at the cost of a much more complex implementation, of an increased index storage size, and of write amplification that could reduce the lifetime of a SSD with its limited number of allowed writes.
- part of the transformation process to turn sequential writes into random-access oriented posting list
- part of a periodically routine, that restores index consistency. In an LSM deletes and updates are not directly performed in the posting list as this would require slow random access. instead for each new delete and update a new entry is sequentially writen into the LSM, and during search only the latest entry is taken into account. This causes that updated and deleted documents occur multiple times within the index, clutter the index, take up space and prolonge the query latency. Compaction has the function to regularly consolidate the index, to remove all tombstones and to integrate all occurrences of the same document into one.
- Size-tiered compaction
- Leveled compaction
- SeekStorms sharded compaction
- index compression
- document compression with zstandard
- posting list compression with Elias-Fano compression or other Posting list compression algorithms
- n-gram indexing
In vector search both documents and queries are transformed into vectors. Then the documents vectors are stored into Nearest Neighbor Indexes to allow for Similarity Search. There are different architectures of Nearest Neighbor Indexes. All of them try with K Nearest Neighbors search to return a number of top-k documents where the document vector is closest to the query vector according to a specific similarity metric:
There is a very nice overview of vector search solution by Dmitry Kan
Turning documents into vectors
Although today the terms vector search and similarity search are used synonymously, we should remember that the embedding step and the resulting semantic similarity search has not been part of vector search from the beginning. At first only vectors of word occurrences have been compared using cosine similarity. That already enabled a kind of similarity search, where not all query keywords were required to appear within the result documents. Only with the advent of machine learning an extra step was added, that turned term vectors into vectors of meanings (dimensions/features), and enabled a whole new perspective in search: searching for semantic similarity.
Today artificial intelligence (AI) methods like machine learning (ML), specifically deep learning (DL) are used to first capture the meaning of words by performing statistical co-occurrence analysis of huge document repositories. This is done by tokenizing the document text, and then calculating for each unique token a word embedding, which is a vector of a certain number (e.g. 768) of dimensions/features. Then, based on the precalculated word embedding data we can during indexing turn our documents of words into vectors of meanings. The whole process is usually an very expensive in terms of time and processing power, often with quadratic scaling properties in regards to the number of documents.
It is done in two steps:
- capture (learn) the meaning of words by statistical co-occurrence analysis of huge document repositories
- turn our documents of words into vectors of meanings, based on the previously precalculated word embedding data
Luckily we can save the first step, as the precalculated word embedding data is released as open-source on the web, e.g. for Google BERT.
Tokenizing text with WordPiece
- Word level embeddings from BERT
- Sentence level embeddings from BERT
Let’s have a look at the three most frequently used vector index methods:
- Linear search
- Hierarchical Navigable Small Worlds (HNSW)
- Index partitioning/clustering
Sometimes also called flat index, linear search is the most straightforward and most inefficient way to search. All document vectors are sequentially compared to the query vector.
The benefit of linear search is the straightforward implementation, a high sequential indexing speed and its perfect recall. The drawback is the limited scaling for large document numbers and slow query speed.
Hierarchical Navigable Small Worlds (HNSW)
We trade higher search speed for approximate search results with limited recall.
The index is partitioned into clusters of similar vectors. The lookup is divided into two steps. First we find the top-c clusters most similar to the query vector, and secondly we search only within those top-c clusters for the top-k documents most similar to the query vector.
The partitioning into clusters can be achieved by different methods
- k-medoids clustering
- k-medians clustering
- k-means/k-centroids clustering
- Locality Sensitive Hashing
- Support Vector machines
With clustering, we trade higher search speed for approximate search results with limited recall.
An example of index clustering with Voronoi diagrams (Dirichlet tessellation) is the FAISS.IndexIVFFlat (IVF = Inverted File Index).
An example of index clustering with Locality Sensitive Hashing is the FAISS.IndexLSH
Other vector search approaches
- document compression with zstandard
- vector quantization (VQ)
- Principal component analysis (PCA) is used for Dimensionality reduction to reduce vectors with large dimensions to smaller dimensions.
- Product Quantization (PQ) is used for compressing and storing vectors of large dimensions
- Optimized Product Quantization (OPQ)
- CPU or GPU hardware acceleration of vector arithmetics
- SIMD (Single instruction, multiple data) provides hardware support for performing an operation on multiple pieces of data, in parallel, using a single instruction. Variants of CPU hardware acceleration of vector arithmetics include Streaming SIMD Extensions SSE, SSE2, SSE3 SSSE3 and SSE4 and Advanced vector extensions AVX, AVX2 and AVX-512.
Speed vs. Accuracy
In IVF based algorithms there is a trade-of between recall vs. indextime vs. query time:
- for the same recall, we can trade-off between indextime vs query time
- additionally we can trade recall for both indexing speed and query speed
- nlist, nprobe and different clustering methods are parameters to balance this