Fallacies of index compression in search
Photo by Sergio Jimenez
Posting list compression is an important building block of every search engine. It reduces index size and query latency, as it reduces the amount of data to be loaded from disk or ram (memory bandwidth). Compression algorithms like roaring bitmaps compress the document ID and are effective on large posting lists with many docid.
But two things are often overlooked:
Word frequencies in a text or corpus are distributed according to Zipf’s law. That means they are long-tail with low-frequency terms are dominating the corpus. And those low-frequency terms with short posting lists with a single or only are few docid are barely compressible with traditional methods. Then suddenly the size of posting list metadata such as term, docid count, pointers, etc. becomes dominating, while being negligible for long posting lists. As short posting lists are dominating the index, efficient data structures for meta data are paramount and are heavily influencing the overall compression rate.
The size of position data, required for phrase intersection is usually at least as large, as the docid list. That’s intuitive because for every docid we need to store at least one position, often more. Position lists are relatively short, as the number of positions is limited by the number of words within a document, while docid posting lists are limited by the number of documents in the corpus. Position lists being short result in their low compressibility. Posting list consists of both docids and positions per docid, where only docids are reasonably compressible, but position data being at least of the same size but much less compressible. Therefore the impressive compression rates for docid lists published in academic papers don’t apply to the posting list as a whole.
When looking at compression efficacy as opposed to compression rate we have to distinguish:
- The index size and its compressibility are dominated by the long tail distribution of all terms in the index.
- For the query latency only the distribution of terms within the query is decisive, with the most frequent term with the longest posting list in the query dominating the query latency