Back to mnoGoSearch site


mnoGoSearch 3.3.12 reference manual: Full-featured search engine software
PrevNext

Chapter 4. Extended indexing features

Table of Contents
News extensions
Creating an MP3 search engine
Indexing SQL tables (htdb:/ virtual URL scheme)
Indexing a program output (exec:/ and cgi:/ virtual URL schemes)
Mirroring
Dumping and restoring the search database

News extensions

Creating an MP3 search engine

mnoGoSearch has a built-in parser for MP3 files. It can extract the Album, the Artist, the Song as well as the Year MP3 tags from an MP3 file. You can create a full-featured MP3 search engine using mnoGoSearch.

MP3 indexer.conf commands

To activate indexing of MP3 tags, you can use the CheckMP3 and CheckMP3Only commands into indexer.conf, as well as activate processing of MP3 sections (they are disabled by default). This is an example of an indexer.conf file with MP3 related commands:


Section MP3.Song               21    128
Section MP3.Album              22    128
Section MP3.Artist             23    128
Section MP3.Year               24    128
CheckMP3 *.mp3
Hrefonly *
With the above configuration, indexer will check all *.mp3 files for MP3 tags, and will collect new links from other file types without indexing.

When you use the CheckMP3 command, indexer downloads only 128 bytes from the files with the given extension(s) to detect and parse MP3 tags.

Note: indexer downloads MP3 files efficiently from FTP servers, as well as from HTTP servers supporting HTTP/1.1 protocol with the Range request header, to request partial content. Old HTTP servers not supporting the Range HTTP header may not work well together with mnoGoSearch.

Restricting search to a certain MP3 section

If you want to restrict searches by Author, Album, Song or Year, you can use the standard mnoGoSearch ways to restrict searches described in the Section called Changing weights of the different document parts at search time in Chapter 10 and the Section called Restricting search words to a section in Chapter 10. For example, if you want to restrict search by song and author name, you use the standard mnoGoSearch way to specify sections: Song: help Author:Beatles.

With the default sections given in indexer.conf-dist, you may find useful to add this HTML form element into search.htm to restrict search area:


Search in:
<SELECT NAME="wf">
  <OPTION VALUE="111100000000000000000000" SELECTED="$(wf)">All MP3 sections</OPTION>
  <OPTION VALUE="000100000000000000000000" SELECTED="$(wf)">MP3 Song name</OPTION>
  <OPTION VALUE="001000000000000000000000" SELECTED="$(wf)">MP3 Album</OPTION>
  <OPTION VALUE="010000000000000000000000" SELECTED="$(wf)">MP3 Artist</OPTION>
  <OPTION VALUE="100000000000000000000000" SELECTED="$(wf)">MP3 Year</OPTION>
</SELECT>

Indexing SQL tables (htdb:/ virtual URL scheme)

mnoGoSearch can index SQL tables with long text columns with help of so called htdb:/ virtual URL scheme.

Using the htdb:/ virtual scheme, you can build a full-text index for your SQL tables as well as index your database driven Web servers.

Note: You have to have a PRIMARY KEY or an UNIQUE INDEX on the table you want to index with HTDB.

HTDB indexer.conf commands

HTDB is implemented using the following indexer.conf commands: HTDBAddr, HTDBList, HTDBLimit, HTDBDoc.

The purposes of the HTDBAddr command is to specify a database connection string. It uses the same syntax to DBAddr. If no HTDBAddr command is specified, the data will be fetched using the same connection specified in DBAddr.

The HTDBList command is used to specify an SQL query which generates a list of documents using either absolute or relative URL notation, for example:


HTDBList "SELECT CONCAT('htdb:/',id) FROM messages"
or

HTDBList "SELECT id FROM messages"

Note: HTDBList allows to fetch non-htdb URLs as well. So it gives another options to use HTDB: you can store the list of "real URLs" (e.g. HTTP-style URLs) in the database and fetch them with help of HTDB.


HTDBList "SELECT url FROM mytable"
Server urllist htdb:/
Realm page *

The SQL query given in HTDBList is used for all documents having the '/' sign in the end of URL. This query is an analog for a file system directory listing.

The HTDBLimit command is used to specify the maximum number of records fetched by a single SELECT query given in the HTDBList command. HTDBLimit helps to reduce memory consumption when indexing large SQL tables. For example:


HTDBLimit 512

The HTDBDoc command specifies an SQL query to get a single document from the database using its PRIMARY KEY value. The HTDBDoc query is executed for all HTDB documents not having the '/' in the end of their URL.

An SQL query given in the HTDBDoc command must return a single row result. If the HTDBDoc query returns an empty set or multiple records, the HTDB retrieval system generates a HTTP 404 Not Found response. This can happen at re-indexing time if the record was deleted from the table since last re-indexing. You can use HoldBadHrefs 0 to remove the deleted records from the mnoGoSearch tables as well.

mnoGoSearch understands three types of HTDBDoc SQL queries.

  • A single-column result with a fully formatted HTTP response, including standard HTTP response status line. Take a look into the Section called HTTP response codes mnoGoSearch understands in Chapter 3 to know how indexer handles various HTTP status codes. A HTDBDoc SQL query can also optionally include HTTP headers understood by indexer, such as Content-Type, Last-Modified, Content-Encoding and other headers. So you can build a very flexible indexing system by returning different HTTP status codes and headers.

    Example:

    
HTDBDoc "SELECT CONCAT(\
    'HTTP/1.0 200 OK\\r\\n',\
    'Content-type: text/plain\\r\\n',\
    '\\r\\n',\
    msg) \
    FROM messages WHERE id='$1'"
    

  • A multiple-column result, with the status line starting from the "HTTP/" substring in the beginning of the first column. All columns are concatenated using the Carriage-Return + New-Line (\r\n) delimiters to generate a HTTP-alike response. The first column returning an empty string is considered as a delimiter between the headers and the content part of the HTTP response, and is replaced to "\r\n\r\n". This type of queries is a simpler way of the previous type. It helps to avoid using concatenation operators and functions, and the "\r\n" header delimiters.

    Example:

    
HTDBDoc "SELECT 'HTTP/1.0 200 OK','Content-type: text/plain','',msg \
    FROM messages WHERE id='$1'"
    

  • A single- or a multiple-column result without the "HTTP/" header. This is the simplest HTDBDoc response type. The SQL column names returned by the query are associated with the Section names configured in indexer.conf.

    Example:

    
Section body  1 256
    Section title 2 256
    HTDBDoc "SELECT title, body FROM messages WHERE id='$1'"
    

    In this example, the values of the columns title and body are associated with the sections title and body respectively.

    The columns with the names status and last_mod_time have a special meaning - the HTTP status code, and the document modification time respectively. Status should be an integer code according to HTTP notation, and the modification time should be in Unix timestamp format - the number of seconds since January, 1, 1970.

    Example:

    
HTDBDoc "SELECT title, body, \
    CASE WHEN messages.deleted THEN 404 ELSE 200 END as status,\
    timestamp as last_mod_time FROM messages WHERE id='$1'"
    

    The above example demonstrates how to use the special columns. The SQL query will return status "404 Not found" for all documents marked as deleted, which will make indexer remove these documents from the search database when re-indexing the data. Also, this query makes indexer use the column timestamp as the document modification time.

    If a column contains data in HTML format, you can specify the html keyword in the corresponding Section command, which will make indexer apply the HTML parser to this column and therefore remove all HTML tags and comments:

    Example:

    
Section title      1 256
    Section wiki_text  2 16000 html
    HTDBDoc "SELECT title, wiki_text FROM messages WHERE id='$1'"
    

HTDB variables

The path parts of an URL can be passed as parameters to the HTDBList and HTDBDoc SQL queries. All parts are to be used as $1, $2, ... $N, where the number represents the N-th path part, that is the part of URL after the N-th slash sign:


htdb:/part1/part2/part3/part4/part5
         $1    $2    $3    $4    $5

For example, you have this indexer.conf command:


HTDBList "SELECT id FROM catalog WHERE category='$1'"

When mnoGoSearch prepares to fetch a document with the URL htdb:/cars/, $1 will be replaced to "cars":


SELECT id FROM catalog WHERE category='cars'

You can use long URLs to pass multiple parameters into both HTDBList and HTDBDoc queries. For example:


HTDBList "SELECT column4 FROM table WHERE column1='$1' AND column2='$2' and column3='$3'"
HTDBDoc  "SELECT title, body FROM table WHERE column1='$1' AND column2='$2' and column3='$3' column4='$4'"
Server htdb:/path1/path2/path3/
Using multiple parameters helps to refer to a certain record using parts of a compound PRIMARY KEY or UNIQUE INDEX.

Using multiple HTDB sources

It's possible to index multiple HTDB sources using multiple HTDBList, HTDBDoc and Server commands in the same indexer.conf.


Section body  1 256
Section title 2 256

HTDBList "SELECT id FROM t1"
HTDBDoc  "SELECT title, body FROM t1 WHERE id=$2"
Server htdb:/t1/

HTDBList "SELECT id FROM t2"
HTDBDoc  "SELECT title, body FROM t2 WHERE id=$2"
Server htdb:/t2/

HTDBList "SELECT id FROM t3"
HTDBDoc  "SELECT title, body FROM t3 WHERE id=$2"
Server htdb:/t3/

Using mnoGoSearch as an external SQL full-text engine

With help of the htdb:/ scheme you can quickly create a full-text index and use it further in your SQL application. Imagine you have a large SQL table which stores a Web board messages in plain text format, and you want to add search functionality to your Web board. Say, the messages are stored in the table messages with two columns id and msg, where id is an integer PRIMARY KEY and msg is a long text column containing messages. Using a usual SQL LIKE search may take a very long time to return a result:


SELECT id, message FROM messages WHERE message LIKE '%someword%'

With help of the htdb:/ scheme provided by mnoGoSearch you can create a full-text index on the table messages. In order to do so you can edit your indexer.conf as follows:


DBAddr mysql://foo:bar@localhost/mnogosearch/?dbmode=single

Section msg 1 256

HTDBAddr mysql://foofoo:barbar@localhost/database/
HTDBList "SELECT id FROM messages"
HTDBDoc "SELECT msg FROM messages WHERE id='$1'"
Server htdb:/

When started, indexer will insert the URL htdb:/ into the database and will execute the SQL query given in HTDBList, which will produce the values 1, 2, 3,..., N in the result. The values will be interpreted as links relative to htdb:/. A list of new URLs in the form htdb:/1, htdb:/2, ..., htdb:/N will be added into the database. Then the HTDBDoc SQL query will be executed for every added URL. HTDBDoc will return the column msg as a document content, which will be associated with the section mgs and parsed. Word information will be stored in the table dict (assuming the single storage mode).

After indexing is done, you can use mnoGoSearch tables to perform search:


SELECT url.url 
FROM url,dict 
WHERE dict.url_id=url.rec_id 
AND dict.word='someword';

The table dict has an index on the column word, so the above query will be executed much faster than the queries using the LIKE operator on the table messages.

You can also use multiple words in search:


SELECT url.url, count(*) as c 
FROM url,dict
WHERE dict.url_id=url.rec_id 
AND dict.word IN ('some','word')
GROUP BY url.url
ORDER BY c DESC;

Both queries will return htdb:/XXX values from the url.url field. Then your application can cut the "htdb:/" prefix from the returned values to get the PRIMARY KEY values from the table messages.

Indexing a database driven Web server

You can also use HTDB to index your database driven Web server. It allows to index your documents without having to invoke your the Web server at indexing time, which should require less CPU resources than direct HTTP indexing and therefore should offload the Web server machine.

The main idea of indexing a database driven Web server is to map HTTP requests into HTDB requests at indexing time. So indexer will fetch the source data directly from the SQL database, meanwhile search.cgi will return real URLs in usual HTTP notation. This can be achieved using the aliasing mechanisms provided by mnoGoSearch.

Take a look at a sample file doc/samples/htdb.conf, which is included into mnoGoSearch source distribution. It is the indexer.conf file used to index the Web board at the mnoGoSearch site .

The HTDBList command generates URLs in the form:


http://www.mnogosearch.org/board/message.php?id=XXX

where XXX is a PRIMARY KEY value from the table messages.

For every PRIMARY KEY value a fully formatted HTTP response is generated, containing a text/html document with headers and this content:


<HTML>
<HEAD>
<TITLE>Subject goes here</TITLE>
<META NAME="Description" Content="Author name goes here">
</HEAD>
<BODY>
Message text goes here
</BODY>

At the end of doc/samples/htdb.conf you can find these commands:


Server htdb:/
Realm  http://www.mnogosearch.org/board/message.php?id=*
Alias  http://www.mnogosearch.org/board/message.php?id=  htdb:/

The first command tells indexer to execute the HTDBList query, which generates a list of messages in the form:


http://www.mnogosearch.org/board/message.php?id=XXX

The second command tells indexer to allow messages matching the given pattern using string match with the '*' wildcard at the end.

The third command replaces the substring http://www.mnogosearch.org/board/message.php?id= in the URL to htdb:/ before a message is downloaded, which forces indexer to use the SQL table as the data source for a document instead of sending an HTTP request to the Web server.

After indexing is done, search.cgi will display search result using the usual HTTP notation, for example: http://www.mnogosearch.org/board/message.php?id=1000

Indexing a program output (exec:/ and cgi:/ virtual URL schemes)

mnoGoSearch offers special virtual URL methods exec:/ and cgi:/. These methods allow to use output of an external program as a source for indexing. mnoGoSearch can work with any executable program that returns results to STDOUT. The result must be conform to the HTTP standard and return full HTTP response headers (including HTTP status line and at least the Content-Type HTTP response header) followed by the document content.

For example, when indexing both cgi:/usr/local/bin/myprog and exec:/usr/local/bin/myprog, indexer will execute the /usr/local/bin/myprog program.

Passing parameters to the cgi:/ virtual scheme

When executing a program given in a cgi:/ URL, indexer emulates environment in the way this program would run in when executed under a HTTP server. It creates the REQUEST_METHOD=GET environment variable, and the QUERY_STRING variable according to the HTTP standards. For example, if cgi:/usr/local/apache/cgi-bin/test-cgi?a=b&d=e is being indexed, indexer creates QUERY_STRING with a=b&d=e value. cgi:/ virtual URL scheme allows indexing your site without having to invoke web servers even if you want to index CGI scripts. For example, you have a web site with static documents under /usr/local/apache/htdocs/ and with CGI scripts under /usr/local/apache/cgi-bin/. You can use the following configuration:


Server http://localhost/
Alias  http://localhost/cgi-bin/  cgi:/usr/local/apache/cgi-bin/
Alias  http://localhost/    file:///usr/local/apache/htdocs/

Passing parameters to the exec:/ virtual scheme

In case of an exec:/ URL, indexer does not create the QUERY_STRING variable, instead it passes all parameters in the command line. For example, when indexing exec:/usr/local/bin/myprog?a=b&d=e, this command will be executed:


/usr/local/bin/myprog "a=b&d=e" 

Using the exec:/ virtual scheme as an external retrieval system

The exec:/ virtual scheme can be used as an external retrieval system. It allows using protocols which are not supported natively by mnoGoSearch. For example, you can use curl program which is available from http://curl.haxx.se/ to index HTTPS sites when mnoGoSearch is compiled without built-in HTTPS support.

Put this short script to /usr/local/mnogosearch/bin/ under name curl.sh.


#!/bin/sh
/usr/local/bin/curl -i $1 2>/dev/null

This script takes an URL given as a command line parameter and executes curl to download the given URL. The -i argument tells curl to output result together with HTTP response headers.

Add these commands into indexer.conf:


Server https://some.https.site/
Alias  https://  exec:/usr/local/mnogosearch/etc/curl.sh?https://

When indexing https://some.https.site/path/to/page.html, indexer will translate this URL to


exec:/usr/local/mnogosearch/etc/curl.sh?https://some.https.site/path/to/page.html

then execute the curl.sh script:


/usr/local/mnogosearch/etc/curl.sh "https://some.https.site/path/to/page.html"

and load its output for indexing.

Note: indexer loads up to MaxDocSize bytes when executing an exec:/ or cgi:/.

Mirroring

Creating a mirror

mnoGoSearch supports some mirroring functionality. To enable mirroring, you can specify the path where indexer will create the mirrors of your sites with help of the MirrorRoot command. For example:


MirrorRoot /path/to/mirror

You can also configure indexer to store HTTP headers on the disk. This can be helpful if you want to use the local mirror for quick reindexing of the remote site. Use the MirrorRoot command to activate storing the HTTP headers. For example:


MirrorHeadersRoot /path/to/headers

Note: MirrorRoot and MirrorHeadersRoot can point to the same directory.

Note: indexer does not download more than MaxDocSize bytes from every documents. If a document is larger, it will be only partially downloaded. Make sure that MaxDocSize is large enough if you want to use the mirror created by as a real site mirror.

Using a mirror as crawler cache.

mnoGoSearch can use a previously created mirror as a crawler cache. It can be useful when you do experiments with mnoGoSearch to find the best configuration: you modify your indexer.conf, then clear the database and index the same sites again. To reduce Internet traffic you can activate loading documents from the mirror using the MirrorPeriod command. For example:


MirrorPeriod 2h

MirrorPeriod specify the period of time when indexer considers the local mirrored copy of a document as valid. If indexer finds that the local mirrored copy is fresh enough, it will not download the same document again and use the local copy instead. If the local is older than MirrorPeriod says, then indexer will download the document from its original location again, and update the locally mirrored copy.

If MirrorHeadersRoot is not specified and therefore the original HTTP headers are not available, then indexer will detect Content-Type of a document using the AddType commands.

The parameter MirrorPeriod should be in the form: xxxA[yyyB[zzzC]], where xxx, yyy, zzz are numbers (can be negative!). Spaces are allowed between xxx and A and yyy and so on. A, B, C can be one of the following:


    s - second
    M - minute
    h - hour
    d - day
    m - month
    y - year

Note: The letters are similar to the descriptors understood by the strptime() and strftime() C functions.

Examples:


15s - 15 seconds
4h30M - 4 hours and 30 minutes
1y6m-15d - 1 year and six month minus 15 days
1h-10M+1s - 1 hour minus 10 minutes plus 1 second

If you specify only a number without any characters, it is assumed that the time is given in seconds.

Note: If you start mirroring in a already existing database, indexer will refuse to create the mirror immediately because of the traffic optimization method described at the Section called Crawling time optimization in Chapter 3. You can run indexer -am once to turn off optimization, or clear the database using indexer -C and then run indexer without any arguments.

Dumping and restoring the search database

Dumping the search database

It is possible to dump and restore a mnoGoSearch SQL database using standard tools supplied with the database software, such as mysqldump or pg_dump. This approach works fine in case of a single SQL database.

However, if you use multiple SQL databases to store mnoGoSearch data, or use mnoGoSearch cluster solution and want to re-distribute data between more SQL databases (say, when adding a new machine into cluster), or want to reduce the number of separate SQL databases (say, when removing a machine from cluster), the standard method of dumping and restoring SQL data will not work because of conflicts in auto-generated values (auto_increment values, SEQUENCE values, IDENTITY values and so so).

Starting from the version 3.3.9, mnoGoSearch includes dump and restore tools which allows to workaround this problem.

Note: As of version 3.3.9, mnoGoSearch dump and restore tools work only with MySQL. Support for the other databases will be added in the future releases.

In order to create a dump of your mnoGoSearch database, you can run:

indexer -Edumpdata > dumpfile.sql
or pipe data to gzip:

indexer -Edumpdata | gzip > dumpfile.sql.gz
to reduce the dump size.

The dump file created by indexer -Edump is a usual SQL dump file, which does not include auto-generated values. A piece of a dump file in case of MySQL database looks like:


--seed=39
INSERT INTO url (...all columns except rec_id...) VALUES (...);
INSERT INTO urlinfo (url_id,sname,sval) VALUES(last_insert_id(),'body','Modules Directives FAQ...');
INSERT INTO urlinfo (url_id,sname,sval) VALUES(last_insert_id(),'CachedCopy','eNrtWc1v2zgWv+ev...');
INSERT INTO urlinfo (url_id,sname,sval) VALUES(last_insert_id(),'Charset','utf-8');
INSERT INTO urlinfo (url_id,sname,sval) VALUES(last_insert_id(),'Content-Language','en');
INSERT INTO urlinfo (url_id,sname,sval) VALUES(last_insert_id(),'Content-Type','text/html');
INSERT INTO urlinfo (url_id,sname,sval) VALUES(last_insert_id(),'title','Apache HTTP Server Ver...');
INSERT INTO bdicti VALUES(last_insert_id(),1,0x6B6F00011EC296170000726577726974696E6700017E4D,0...');
The dump file consists of chunks of INSERT instructions for every document. The structure of the dump file forces MySQL to assign a new auto-increment value for the column url.rec_id and use this value to insert data into the child tables urlinfo and bdicti at restore time.

Additionally, every chunk consists of the comment --seed=xxx which is used to distribute data between multiple database properly at restore time.

By default, indexer -Edump dumps data from all databases specified in indexer.conf file. You can use the -D command line argument to dump data from a certain database only. For example:


indexer -Edump -D2
will dump data from the database described by the second command DBAddr in indexer.conf.

Restoring the search database

To restore a search database from a dump file, use:


indexer -Esql -v2 < dumpfile.sql
or in case of .gz file:

zcat dumpfile.sql.gz | indexer -Esql -v2
indexer will load the data back to the SQL database. In case if you have two or more DBAddr commands in the current indexer.conf file, indexer will also properly distribute the data between the corresponding SQL databases.