First of all, mifluz
is at beta stage.
This program is part of the GNU project, released under the aegis of GNU.
The purpose of mifluz
is to provide a C++ library to store a full
text inverted index. To put it briefly, it allows storage of occurrences of
words in such a way that they can later be searched. The basic idea of
an inverted index is to associate each unique word with a list of
documents in which they appear. This list can then be searched to locate
the documents containing a specific word.
Implementing a library that manages an inverted index is a very easy
task when there is a small number of words and documents. It becomes a
lot harder when dealing with a large number of words and
documents. mifluz
has been designed with the further upper limits
in mind : 500 million documents, 100 giga words, 18 million document
updates per day. In the present state of mifluz
, it is possible to
store 100 giga words using 600 giga bytes. The best average insertion
rate observed as of today 4000 key/sec on a 1 giga byte index.
mifluz
has two main characteristics : it is very simple (one
might say stupidly simple :-) and uses 100% of the size of the indexed text for
the index. It is simple because it provides only a few basic
functions. It does not contain document parsers (HTML, PDF
etc...). It does not contain a full text query parser. It does not
provide result display functions or other user friendly stuff. It only
provides functions to store word occurrences and retrieve them. The fact
that it uses 100% of the size of the indexed text is rather
atypical. Most well known full text indexing systems only use 30%. The
advantage mifluz
has over most full text indexing systems is that
it is fully dynamic (update, delete, insert), uses only a controlled
amount of memory while resolving a query, has higher upper limits and has a
simple storage scheme. This is achieved by consuming more disk space.