Previous: Berkeley DB Compression, Up: Compression


8.1.2 Page compression in Mifluz

The mifluz classes WordDBCompress and WordBitCompress do the compression/decompression work. From the list of keys stored in a page it extracts several lists of numbers. Each list of numbers has common statistical properties that allow good compression.

The WordDBCompress_compress_c and WordDBCompress_uncompress_c functions are C callbacks that are called by the the page compression code in BerkeleyDB. The C callbacks then call the WordDBCompress compress/uncompress methods. The WordDBCompress creates a WordBitCompress object that acts as a buffer holding the compressed stream.

Compression algorithm.

Most DB pages contain redundant data because mifluz chose to store one word occurrence per entry. Because of this choice the pages have a very simple structure.

Here is a real world example of what a page can look like: (key structure: word identifier + 4 numerical fields)

     756     1 4482    1  10b
     756     1 4482    1  142
     756     1 4484    1   40
     756     1 449f    1  11e
     756     1 4545    1   11
     756     1 45d3    1  545
     756     1 45e0    1  7e5
     756     1 45e2    1  830
     756     1 45e8    1  545
     756     1 45fe    1   ec
     756     1 4616    1  395
     756     1 461a    1  1eb
     756     1 4631    1   49
     756     1 4634    1   48
     .... etc ....

To compress we chose to only code differences between adjacent entries. A flag is stored for each entry indicating which fields have changed. When a field is different from the previous one, the compression stores the difference which is likely to be small since the entries are sorted.

The basic idea is to build columns of numbers, one for each field, and then compress them individually. One can see that the first and second columns will compress very well since all the values are the same. The third column will also compress well since the differences between the numbers are small, leading to a small set of numbers.