Vector search vs. Keyword search - War of the worlds vs. we come in peace
Photo by Mateusz Wacławek
This is part 1 of a 4-part blog post series on Vector vs. keyword search:
- War of the worlds vs. we come in peace
- Data structures and algorithms
- LSMT-IVF for Billion-Scale Approximate Nearest Neighbor Search
Vector vs. keyword search 1: War of the worlds vs. we come in peace
Vector search has lived in the shadow for quite some time, but only with the advent of machine learning and embeddings it became popular - by enabling to search for semantic meaning instead of keywords. But is semantic search (vector index) indeed a replacement/successor for the older keyword search (inverted index) as some suggest, or is it rather complementary?
Searching has been deeply engraved in human nature since the beginning of time. People have been searching for food, for water, for oil, for fish, for enemies, for treasures, for wealth and for love. Some things have changed, some things stay always true. Today we are also searching for data, information and meaning
Luckily the wisdom of human species evolves over time, and new solutions are found and added to their toolbelt. Sometimes in the joy and ecstasy about a new technology and all the possibilities that a new shiny toy offers, we tend to dismiss and abandon our trusty old tools. This excitement for new things shouldn’t allure us to overlook that they have their own shortcomings and fool us to dismiss the benefits of the older time-tested technology.
It is often a mistake to imply that old means outdated and obsolete. More often than not, it is the contrary, a solution found earlier and used for a long time means it is THE obvious, natural, logic, time-tested and mature solution to a problem. Nobody would seriously dismiss the ingenuity and usefulness of the wheel in modern times, only because it was invented 4000 years BC. On the other hand, how many novel and seemingly ingenious inventions have vanished again after a short time of glory. HD Dvd, MiniDisk, Betamax, Laser disk, Zip disk, Pager anyone?
Instead of ‘newer is better than older’, often both solutions have their own right to exist individually, as both are optimal in their specific domain. And sometimes instead of ‘either/or’, the combination of both solutions is better than either alone.
As search consists of both exact and semantic aspects of information retrieval, it makes sense to combine both algorithms. But not in the sense that one is only an addon, but that the architecture supports both use cases naturally and equally.
Let’s look at the two dominating search architectures - and well - schools of thought: keyword search (inverted index) and semantic search (vector search).
If you search for exact results like proper names, numbers, license plates, domain names, and phrases then keyword search is your friend. Vector search on the other hand will bury the exact result that you are looking for among a myriad results that are only somehow related. At the same time, if you don’t know the exact terms, or you are interested in a broader topic, meaning or synonym, no matter what exact terms are used, then keyword search will fail you.
- high indexing speed (for large document numbers)
- medium index size
- high query speed (for large document numbers)
- good scaling (for large document numbers)
- perfect precision (for exact keyword match)
- recall: perfect for exact keyword match, low for semantic meaning
- unable to capture meaning and similarity
- efficient and lossless for exact keyword and phrase search
Vector search is perfect if you don’t know the exact query terms, or you are interested in a broader topic, meaning or synonym, no matter what exact query terms are used. But if you are looking for exact terms, e.g. proper names, numbers, license plates, domain names, and phrases then you should always use keyword search. Vector search will but bury the exact result that you are looking for among a myriad results that are only somehow related.
Vector search enables you to search not only for similar text, but also for similar images, face recognition or finger prints and it enables you to do magic things like queen - woman + man = king
- slower indexing speed (for large document numbers)
- large index size
- slower query speed (for large document numbers)
- limited scaling (for large document numbers)
- lower precision (for exact keyword match)
- recall: high for semantic meaning (80/90%), medium for exact keyword match
- able to capture meaning and similarity
- inefficient and lossy for exact keyword and phrase search
Tip of the iceberg
Often when we discuss search we talk about diverse algorithms and architectures. But we overlook that the architecture is only the tip of the iceberg, and that the true hard work is in the implementation. As they say, success is 10 percent inspiration and 90 percent perspiration.
You can implement a search based on an inverted index within an hour. The same is true for vector search. Yes, building a search engine is easy, so there are a lot of them, and all are really fast … as long as you have only 1000 documents indexed, and only a single concurrent searcher. The hard part is scaling: Searching thousand indices with billions documents, with thousand concurrent users and still returning results within milliseconds on a single machine.
And both keyword search and vector search share the same challenges for query latency, indexing speed, scaling for huge number of documents or concurrent users.
Also, a search engine consists of so much more components than just the core index architecture: parsing, stemming, spelling correction, auto completion, query rewriting, near duplicate detection, instant search, real-time search, reliability, geographic distribution, result scoring, result sorting, result filtering, result grouping etc.
And not surprisingly, the solutions to overcome those challenges are almost identical for keyword search and vector search.
Combine and conquer
By combining keyword search (inverted index) and semantic approximate nearest neighbor search (vector index) we are combining precision and completeness of results, speed of indexing of search, unlimited scaling for large document repositories and concurrent user numbers with understanding of meaning, concepts, similarity and synonyms. That allows us to cover all aspects of search and their diverse needs for different users, domains and circumstances.
Keyword & vector search can work together in peace and combine their strengths 💪!