10.23.5 mifluz CONFIGURATION
The format of the configuration file read by WordContext::Initialize is:
keyword: value
Comments may be added on lines starting with a #. The default
configuration file is read from from the file pointed by the
MIFLUZ_CONFIG
environment variable or
~/.mifluz
or
/etc/mifluz.conf
in this
order. If no configuration file is available, builtin defaults are used.
Here is an example configuration file:
wordlist_extend: true
wordlist_cache_size: 10485760
wordlist_page_size: 32768
wordlist_compress: 1
wordlist_wordrecord_description: NONE
wordlist_wordkey_description: Word/DocID 32/Flags 8/Location 16
wordlist_monitor: true
wordlist_monitor_period: 30
wordlist_monitor_output: monitor.out,rrd
- ‘wordlist_allow_numbers {true|false} <number> (default false)’
-
A digit is considered a valid character within a word if
this configuration parameter is set to
true
otherwise
it is an error to insert a word containing digits.
See the
Normalize
method for more information.
- ‘wordlist_cache_inserts {true|false} (default false)’
-
If true all
Insert
calls are cached in memory. When the
WordList object is closed or a different access method is called
the cached entries are flushed in the inverted index.
- ‘wordlist_cache_max <bytes> (default 0)’
-
Maximum size of the cumulated cache files generated when doing bulk
insertion with the
BatchStart()
function. When this limit is
reached, the cache files are all merged into the inverted index.
The value 0 means infinite size allowed.
See WordList(3) for the rationale behind cache file handling.
- ‘wordlist_cache_size <bytes> (default 500K)’
-
Berkeley DB cache size (see Berkeley DB documentation)
Cache makes a huge difference in performance. It must be at least 2%
of the expected total data size. Note that if compression is activated
the data size is eight times larger than the actual file size. In this
case the cache must be scaled to 2% of the data size, not 2%
of the file size. See
Cache tuning
in the mifluz guide for
more hints.
See WordList(3) for the rationale behind cache file handling.
- ‘wordlist_compress {true|false} (default false)’
-
Activate compression of the index. The resulting index is eight times
smaller than the uncompressed index.
- ‘wordlist_env_dir <directory> (default .)’
-
Only valid if
wordlist_env_share
set to
true.
Specify the directory in which the sharable environment will
be created. All
inverted indexes specified with a non-absolute pathname will be
created relative to this directory.
- ‘wordlist_env_share {true,false} (default false)’
-
If true a sharable environment is open or created if none exist.
- ‘wordlist_env_skip {true,false} (default false)’
-
If true no environment is created at all. This must never
be used if a
WordList
object is created. It may be
useful if only
WordKey
objects are used, for instance.
- ‘wordlist_extend {true|false} (default false)’
-
If
true
maintain reference count of unique
words. The
Noccurrence
method gives access to this count.
- ‘wordlist_locale <locale> (default C)’
-
Set the locale of the program to
locale
. See setlocale(3)
for more information.
- ‘wordlist_lowercase {true|false} <number> (default true)’
-
If a word contains upper case letters it is converted to lowercase
if this configuration parameter is true, otherwise it is left
untouched.
- ‘wordlist_maximum_word_length <number> (default 25)’
-
The maximum length of a word.
See the
Normalize
method for more information.
- ‘wordlist_mimimun_word_length <number> (default 3)’
-
The minimum length of a word.
See the
Normalize
method for more information.
- ‘wordlist_monitor {true|false} (default false)’
-
If true create a
WordMonitor
instance to gather statistics and
build reports.
- ‘wordlist_monitor_output <file>[,{rrd,readable] (default stderr)’
-
Print reports on
file
instead of the default
stderr
.
If
type
is set to
rrd
the output is fit for the
benchmark-report
script. Otherwise it a (hardly :-) readable
string.
- ‘wordlist_monitor_period <sec> (default 0)’
-
If the value
sec
is a positive integer, set a timer to
print reports every
sec
seconds. The timer is set using
the ALRM signal and will fail if the calling application already
has a handler on that signal.
- ‘wordlist_page_size <bytes> (default 8192)’
-
Berkeley DB page size (see Berkeley DB documentation)
- ‘wordlist_truncate {true|false} <number> (default true)’
-
If a word is too long according to
the
wordlist_maximum_word_length
it is truncated
if this configuration parameter is
true
otherwise it
is considered an invalid word.
- ‘wordlist_valid_punctuation [characters] (default none)’
-
A list of punctuation characters that may appear in a word.
These characters will be removed from the word before insertion
in the index.
- ‘wordlist_verbose <number> (default 0)’
-
Set the verbosity level of the WordList class.
1 walk logic
2 walk logic details
3 walk logic lots of details
- ‘wordlist_wordkey_description <desc> (no default)’
-
Describe the structure of the inverted index key.
In the following explanation of the
<desc>
format,
mandatory words are
in bold and values that must be replaced in italic.
Word
bits/name bits
[/...]
The
name
is an alphanumerical symbolic name for the key field.
The
bits
is the number of bits required to store this field.
Note that all values are stored in unsigned integers (unsigned int).
Example:
Word 8/Document 16/Location 8
- ‘wordlist_wordkey_document [field ...] (default none)’
-
A white space separated list of field numbers that define a document.
The field number list must not contain gaps. For instance 1 2 3 is
valid but 1 3 4 is not valid.
This configuration parameter is not used by the mifluz library
but may be used by a query application to define the semantic of
a document. In response to a query, the application will return a
list of results in which only distinct documents will be shown.
- ‘wordlist_wordkey_location field (default none)’
-
A single field number that contains the position of a word in a
given document.
This configuration parameter is not used by the mifluz library
but may be used by a query application.
- ‘wordlist_wordrecord_description {NONE|DATA|STR} (no default)’
-
NONE: the record is empty
DATA: the record contains an integer (unsigned int)
STR: the record contains a string (String)