SeekStorm's LSMT-IVF for Billion-Scale Approximate Nearest Neighbor Search
Photo by Mateusz Wacławek
This is part 3 of a 4-part blog post series on Vector vs. keyword search:
- War of the worlds vs. we come in peace
- Data structures and algorithms
- LSMT-IVF for Billion-Scale Approximate Nearest Neighbor Search
- Benchmarking
Vector vs. keyword search 3: LSMT-IVF for Billion-Scale Approximate Nearest Neighbor Search
While in the first part we concentrated on the more philosophical aspects of evolving technology and choosing an appropriate architecture, and in the second part we had a closer look at the fundamentals of both vector search and keyword search, in the third part we want to talk about how Seekstorm as a deep tech start-up is committed to put its money where its mouth is and to deliver on our ideas. As always, uncompromised performance and scaling are paramount for us.
I am convinced that one will never fully understand the inner working of a complex system, its challenges and bottlenecks, and the compromises to be made if you don’t implement it yourself from scratch. And only then you will be able to improve and not be limited by the glass ceiling set by legacy solutions.
When I started working on approximate string search in large dictionaries I have always been pointed to Peter Norvig’s python code of spelling correction. It was a very concise code together with a very educational post, but performance-wise I have been somewhat disappointed. That was the inspiration for me to start SymSpell, an approximate string search and spelling correction library, that is 1000.000 times faster than Norvig’s code for edit distance 3.
With the Pruning Radix Trie we have developed a novel data structure, derived from a radix trie — but 3 orders of magnitude faster.
So, while I really do appreciate and respect the work that went into popular solutions, I have learned there is always room for improvement and I’m convinced that the future of search has just begun.
How about vector search? I have been following the advent of vector search, deep learning, and embeddings for quite some time. And I am really excited about the opportunities it enables. But in contrast to others, I’m not ready to dispose of the proven concept of an inverted index altogether. I see benefits in both approaches. Inverted index search offers precise search with perfect recall and impressive performance and scaling properties, while vector search enables the very useful concept of semantic similarity to search. But currently, vector search is not yet competitive in terms of performance and scaling for billion-scale search and beyond.
That spurred my interest. Could we do better, and how? What are the challenges, and how we can address them? As always, I took the best state-of-the-art-technology as a benchmark baseline for our own research and prototype implementation. Facebooks FAISS.IndexFlatIP, FAISS.IndexHNSWFlat, FAISS.IndexLSH are a good starting point.
Challenges in vector search
What are the challenges in a vector search architecture?
- scaling for large document numbers: either the index size or the indexing time required for “construction” or “learning” grow quadratically
- indexing / precalculation latency
- search latency
- recall in approximate nearest neighbor search (ANNS)
- expensive cosine similarity calculation for large vectors with many dimensions/features
- index size
- CRUD support and incremental indexing and updating: many indices are immutable write-once, the require all documents of the corpus at once for the “learning phase” before the indexing into the “learned” index can start, and they require re-learning/re-indexing if the whole index if something has changed, is updated or deleted.
- true real-time search capability including concurrent indexing/updating and searching
Apart from the part where the embeddings (e.g BERT) are generated by deep learning - there is nothing very specific about vector search. It’s all about the usual trade-offs one has to make when designing a high-performance search engine architecture (Indexing performance vs. search performance, storage space vs. storage cost vs storage speed) and all the usual suspects to deal with it: partitioning, hierarchies, level, segments, LSM-trees and tries, concurrency. locking-(free), compression, hardware acceleration, RAM/SSD hybrid strategies, persistence, crash safety, recovery strategies, etc. We are right at home here. Being in full control of every bit of our architecture allowed us to integrate vector search as a first-class citizen on par with keyword search. SeekStorm adds vector search to its toolbelt, not as an afterthought to keyword search, but equally genuine, with full performance and no restrictions.
SeekStorm’s neural search combines AI, machine learning, specifically deep learning, embeddings, and high-performance, and highly scalable vector search into semantic search capability that understands meanings, concepts, similarity, and synonyms.
So let us introduce SeekStorm’s LSMT-IVF - a new data structure for Billion-Scale Approximate Nearest Neighbor Search.
SeekStorms new vector search algorithm LSMT-IVF is a combination of log-structured merge tree (LSMT) and IVF (inverted index file).
In Facebooks FAISS.IVF method the vectors are partitioned into a nlist number of clusters. At search time we first search the nprobe number of closest clusters, and then we search only within the nprobe clusters for the topk closest vectors. By limiting the search to the nprobe closest clusters we turn to a non-exhaustive, approximate nearest neighbor search with recall<1.
Trade-offs in IVF based algorithms:
- for the same recall, we can trade-off between indexing speed vs. query speed
- additionally, we can trade recall for both indexing speed and query speed
- nlist, nprobe, and different clustering methods are parameters to balance this trade-off
The log structures merge tree (LSMT) data structure allows indexing to relatively slow memory (HDD, SSD) with high insert volume. It utilizes that both HDD and SSD have a much higher sequential write rate than random access write rate. The algorithm rewrites the data sequentially over several levels until they are segmented in a way that allows random access at query time. SeekStorm uses a sampling-based k-medoid clustering instead of Lloyd’s k-means centroid clustering (Voronoi iteration or relaxation) used in FAISS.IVF.
Combining both LSMT and IVF allows for
- Incremental training/indexing,
- true real-time search,
- RAM/SSD hybrid index for large index size
Benchmarking SeekStorm’s new vector search algorithm LSMT-IVF against FAISS.IVF for 1 million docs with the Sift1M dataset:
37x faster learning/index speed and 3x higher search speed (nlist=4096) OR
3x faster learning/index speed and 4x higher search speed (nlist=1024)
for same recall (97%), same hardware (CPU, multithreaded)
FAISS is written in C++ and compiled ahead-of-time to machine code, while LSMT-IVF is written in C# and compiled just-in-time to .NET CLR.
Properties of SeekStorm’s LSMT-IVF:
- Hybrid RAM/SSD storage
- unlimited scaling for a large number of documents (billion-scale)
- incremental indexing (streaming instead requiring upfront the whole corpus for training)
- concurrent indexing, updating and searching
- full CRUD support (index is updatable vs. immutable index that requires re-training and re-indexing the whole corpus)
- full real-time persistence with ACID
- SIMD CPU hardware acceleration for dot product and cosine vector calculation
- high indexing speed, independent from corpus size (linear instead of quadratic construction or learning)
- low query latency (<10 ms) with high recall for billions of documents
- true real-time search
- faceted search filtering for vector search (value and range facets)
- integrated document and object store
- agnostic to vector data source of arbitrary dimensions/features
- selectable similarity metrics
- pluggable clustering algorithms
- pluggable dimensionality reduction and compression algorithms
SeekStorm new vector index is currently experimental 🧪 and not yet in production.