The mifluz
classes WordDBCompress and WordBitCompress do the compression/decompression
work. From the list of keys stored in a page it extracts several lists
of numbers. Each list of numbers has common statistical properties that
allow good compression.
The WordDBCompress_compress_c and WordDBCompress_uncompress_c functions are C callbacks that are called by the the page compression code in BerkeleyDB. The C callbacks then call the WordDBCompress compress/uncompress methods. The WordDBCompress creates a WordBitCompress object that acts as a buffer holding the compressed stream.
Compression algorithm.
Most DB pages contain redundant data because mifluz
chose
to store one word occurrence per entry.
Because of this choice the pages have a very simple structure.
Here is a real world example of what a page can look like: (key structure: word identifier + 4 numerical fields)
756 1 4482 1 10b 756 1 4482 1 142 756 1 4484 1 40 756 1 449f 1 11e 756 1 4545 1 11 756 1 45d3 1 545 756 1 45e0 1 7e5 756 1 45e2 1 830 756 1 45e8 1 545 756 1 45fe 1 ec 756 1 4616 1 395 756 1 461a 1 1eb 756 1 4631 1 49 756 1 4634 1 48 .... etc ....
To compress we chose to only code differences between adjacent entries. A flag is stored for each entry indicating which fields have changed. When a field is different from the previous one, the compression stores the difference which is likely to be small since the entries are sorted.
The basic idea is to build columns of numbers, one for each field, and then compress them individually. One can see that the first and second columns will compress very well since all the values are the same. The third column will also compress well since the differences between the numbers are small, leading to a small set of numbers.