mnoGoSearch can use
a number of different formats (modes) to store
word information in the database, suitable for
different purposes. The available modes
are: single,
multi and blob.
The default mode is blob.
The mode can be selected using the
DBMode part of
the DBAddr command
in indexer.conf and
search.htm.
Examples:
DBAddr mysql://localhost/test/?DBMode=single
DBAddr mysql://localhost/test/?DBMode=multi
DBAddr mysql://localhost/test/?DBMode=blob
The single mode is suitable for a small site
with the total number of documents up to 5000.
When the single mode is specified,
all words are stored in a single table dict
with three columns (url_id,word,coord),
where url_id is the ID
of the document which is referenced by rec_id
field in the table url, and coord
is a combination of the section ID and
position of the words in the section.
Word has the variable char(32) SQL type.
Every appearance of the same word in a document produces a separate record in the table.
The advantage of the single mode is live updates
support - a document updated by indexer
becomes immediately visible for searches with its new content.
In other words crawling and
indexing is done at the same time,
for every document individually.
Another advantage of the single mode is its
simplicity and straightforward data format. You can use
mnoGoSearch as a
fulltext solution for your database-driven Web application.
For example, you may find useful to create a simple search page
which will query the data collected by indexer
this way:
SELECT
url.url, count(*) AS rank
FROM
dict, url
WHERE
url.rec_id=dict.url_id
AND
dict.word IN ('some','words')
GROUP BY
url.url
ORDER BY
rank DESC;
and display the results of this search query.
The multi mode is suitable for a medium size Web space
with up to about 50000 documents. It can be
useful if your documents are updated very often.
If the multi mode is selected, word information
is distributed into 256 separate tables
dict00..dictFF using a hash
function for distribution. The structure of these
tables is close to the table dict used
in the single mode: (url_id,secno,word,coords).
The difference is that all positions of the same word (hits)
in a section of a document are grouped into a single binary array
coords, instead of producing multiple records.
Word information for different sections is stored in separate records.
Similar to the single mode,
the multi mode supports live updates.
That is, crawling and indexing are done at the same time.
A new document (or an updated document) becomes available
for search very soon after indexer
has crawled it.
When working in the multi mode,
indexer performs caching
of the word information in memory for better
crawling performance. The word cache is
flushed to the database as soon as it grows up
to the value given in WordCacheSize,
with 8Mb by default.
You can change WordCacheSize to
a bigger value for better crawling performance.
Note:
The disadvantage of having a too big WordCacheSize
value is that in case when indexer
crashes or dies for any other reasons, all cached information gets lost.
Grouping word hits into the same record and distribution
between multiple tables make the multi
mode much faster both for search and indexing comparing
to the single mode.
The blob mode is the fastest mode currently
available in mnoGoSearch for both purposes: indexing and searching.
This mode can handle up to 1,000,000 - 2,000,000
million documents on a single machine.
DBMode=blob is know to work fine with
DB2,
Mimer,
MS SQL,
MySQL,
PostgreSQL,
Oracle,
Sybase,
Firebird/Interbase,
SQLite3.
In the blob mode crawling and indexing are done separately.
Crawling is done by starting indexer without
any command line arguments. At crawling time indexer
collects word information into the table bdicti
with a structure optimized for crawling purposes, but not suitable for
search purposes.
After crawling is done, an extra step is required to
create the search index by launching
indexer -Eblob. When creating
the search index, indexer
loads information from the table bdicti,
groups all hits of the same word in different documents together
and writes the grouped data into the table bdict
with a structure optimized for search purposes.
The table bdict consists of three columns
(word, secno, intag),
where intag is
a binary array which includes information about all documents this
word appears in (using 32-bit IDs of the documents),
as well as positions of the word in every document (for phrase search).
The table bdict has an index on the column
word for fast look-up at search time.
Words from different sections (e.g. title
and body) are written in separate records.
Note:
Separate records for different sections are needed to optimize
searches with section limits, for example "find only in title".
Also, additional arrays of data are written into the table
bdict:
#rec_id - a list of 32-bit document IDs
#last_mod_time - an array of 32-bit
Last-Modified values
(in Unix timestamp format) -
for fast limiting searches by date.
#pop_rank - an array of popularity rank values,
each in the 32-bit float format.
#site_id - an array of 32-bit
site IDs, for GroupBySite.
#limit#name - a list of document IDs
covered by a user defined limit with name "name".
A separate #limit#xxx record is created
for every user defined Limit configured
in indexer.conf.
#ts - the timestamp indicating when
indexer -Eblob was executed last time,
in textual representation, using the Unix timestamp format.
This value is used for invalidating old queries stored
in the search result cache, as well as for searches
with live updates, described in
the Section called Live updates emulator with DBMode=blob.
#version - a string representing
the version ID of
indexer which created
the search index. For example, indexer
from mnoGoSearch 3.3.0 writes
the string "30300". This record
is required for easier upgrade purposes, to make
a newer version of search.cgi
recognize an older format.
Note, creating fast search index is also possible for
the databases using DBMode=single and
DBMode=multi.
This is useful when you need to quickly switch to
DBMode=blob when
search performance with the other modes became bad -
without even having to re-index
your Web space. Later you can completely
switch to DBMode=blob in both
indexer.conf and
search.htm, and run indexing
from the very beginning.
The disadvantage of DBMode=blob is that
it does not support live updates directly. New or updated documents,
crawled by indexer
are not visible for search until indexer -Eblob
is run again. Creating search index takes about 6
minutes on a collection with 200000
HTML documents, with 10Gb
total size (on a Intel Core Duo 2.13GHz CPU),
which can be unacceptably long for some applications
(for example, on a news site,
or when using mnoGoSearch
as an external full-text engine for SQL
tables with help of HTDB).
Starting from version 3.3.1,
mnoGoSearch emulates
live updates by reading
word information for the new or updated documents
directly from the crawler table bdicti.
It allows to add or update up to about 10,000
documents without having to run indexer -Eblob.
To activate using live updates,
please add LiveUpdates=yes parameter
to the DBAddr command in search.htm.
Example:
DBAddr mysql://root@localhost/test/?DBMode=blob&LiveUpdates=yes
Starting with the version 3.3.0,
indexer -Eblob can be used in combination
with URL and Tag limits,
other limits described in the Section called Subsection control in Chapter 3,
as well as in combination with a user defined limit described
by a Limit command.
The limits allow to generate a search index over a subset of
the documents collected by indexer
at crawling time.
Examples:
indexer -Eblob -u %/subdir/%
indexer -Eblob -t tag
indexer -Eblob --fl=limitname
Starting with the version 3.2.36
an additional command is available: indexer -Erewriteurl.
When indexer is launched with this parameter
it rewrites URL data for DBMode=blob.
It can be useful to rewrite URL data quickly
without having to rebuild the entire search index, for example
if you added the Deflate=yes parameter to
DBAddr, or after running
indexer -n0 -R to update the
Popularity Rank.
Starting from the version 3.3.0,
mnoGoSearch enumerates
words positions for every section separately and
allows to store information about up to 2 million
words per section.
In the versions prior to 3.3.0
it was possible to store up to 64K words
from a single document.
The
single,
multi and
blob modes support substring search.
An SQL query containing
a LIKE predicate is executed
internally in order to do substring search. Substring
search is usually slower than searching for a full word,
especially in case of very short substring.
You can use the SubstringMatchMinWordLength
command to limit the minimal word length allowed for substring search.
Note:
When performing substring search in the multi
mode, search.cgi has to iterate
search queries through all 256 tables
dict00..dictFF,
which makes substring search especially slow. Using
substring search is not recommended with DBMode=multi.