Back to mnoGoSearch site


mnoGoSearch 3.3.12 reference manual

Full-featured search engine software


Table of Contents
1. Introduction
mnoGoSearch Features
Where to get mnoGoSearch.
Disclaimer
Authors
Contributors (in no particular order)
Frequently Asked Questions
2. Installing mnoGoSearch
SQL database requirements
Supported operating systems
Tools required for installation
Installing mnoGoSearch
Running search.cgi from inetd / xinetd
Possible installation problems
Creating a binary package
Installation registration
3. Indexing
Indexing in general
Configuration
Creating SQL table structure
Dropping SQL table structure
Running indexer
HTTP redirects
Crawling time optimization
Subsection control
How to clear the database
Database Statistics
Using indexer for site validation
Running multiple indexer instances for crawling
Running indexer with multiple threads
HTTP response codes mnoGoSearch understands
Content-Encoding support
indexer configuration
Specifying the Web space for indexing
Aliases
ServerTable
FlushServerTable
Using syslog
Disabling Apache logging
Cached copies
Configuring cached copies
Using cached copies at search time
Moving cached copies to another machine
Using the original document as a cached copy source
4. Extended indexing features
News extensions
Creating an MP3 search engine
MP3 indexer.conf commands
Restricting search to a certain MP3 section
Indexing SQL tables (htdb:/ virtual URL scheme)
HTDB indexer.conf commands
HTDB variables
Using multiple HTDB sources
Using mnoGoSearch as an external SQL full-text engine
Indexing a database driven Web server
Indexing a program output (exec:/ and cgi:/ virtual URL schemes)
Passing parameters to the cgi:/ virtual scheme
Passing parameters to the exec:/ virtual scheme
Using the exec:/ virtual scheme as an external retrieval system
Mirroring
Creating a mirror
Using a mirror as crawler cache.
Dumping and restoring the search database
Dumping the search database
Restoring the search database
5. mnoGoSearch HTML parser
Tag parser
HTML entities
META tags
Links
Comments
6. External parsers
Supported parser types
Setting up parsers
Preventing indexer from getting stuck on a parser execution
Pipes in a parser command line
Parsers and character sets
The UDM_URL environment variable
External parsers for the most common file types
MS Word (*.doc)
MS Excel (*.xls)
MS PowerPoint (*.ppt)
MS Word 2007 (*.docx)
Rich Text (*.rtf)
Adobe Acrobat (*.pdf)
PostScript (*.ps)
RPM
7. Storing mnoGoSearch data
Word modes with an SQL database
Various modes used to store words
Storage mode - single
Storage mode - multi
Storage mode - blob
Live updates emulator with DBMode=blob
Extended features with DBMode=blob
Maximum amount of words collected from a document
Substring search notes
Cache mode storage
mnoGoSearch performance issues
MySQL performance
Post-indexing optimization
Oracle notes
Introduction
Compilation, Installation and Configuration
IBM DB2 notes
8. Subsections
Categories
Tags
Adding tags
Using tags at search time
Using substring tag match
Multiple selections
Using tags with indexer
9. Multiple languages support
Character sets
Supported character sets
Multiple languages in the same database
Character set conversion
Character set conversion at search time
Character sets aliases
Document character set detection
Automatic character set guesser
The default character set
The default Language
Search pages with multi-lingual interface
Installing a multi-lingual interface
How it works
Possible troubles
Segmenters for Chinese, Thai and Japanese languages
Japanese phrase segmenter
Chinese phrase segmenter
Thai phrase segmenter
The CJK phrase segmenter
Indexing multilingual servers
10. Searching documents
Using search front-ends
Performing search
Search parameters
Changing weights of the different document parts at search time
Changing importance of individual query words
Using search.cgi with SSI
Using multiple templates
Advanced Boolean search
Restricting search words to a section
Phrase search
Exact section match
How search handles expired documents
How to write search result templates
Template file structure
Template variable formats
Template sections
Includes in templates
Security issues
Designing search.htm
How the search results page is created
Your HTML
Forms considerations
Relative links in search.htm
Adding a small Search form to the other pages of your site
Template operators
Conditional operators
The SET operator
The COPY operator
Arithmetic operators
Loop operators
Miscellaneous operators and functions
Ranking documents
Commands affecting document score
Relevancy
Analyzing score values
Popularity rank
Crosswords
Tracking search queries
Search results cache
Fuzzy search
Ispell
Synonyms
Dehyphenation
Loading synonyms and word forms from the SQL database
Dumping Ispell data
Transliteration
Searching numbers
Accent insensitive search
11. mnoGoSearch cluster
Introduction
How it works
Operations done on the database machines
How a typical XML response looks like
Operations done on the front-end machine
Cluster types
Installing and configuring a "merge" cluster
Installing and configuring a "distributed" cluster
Using dump/restore tools to add cluster nodes
Cluster limitations
12. Miscellaneous
Environment variables
Using mnoGoSearch as an embedded search library
libmnogosearch
mnoGoSearch API
The udm-config script
MySQL fulltext parser plugin
Database schema
Reporting bugs
Currently known bugs
Core dump reports
I. Reference
I. mnoGoSearch command reference
AddType -- associates file names or extensions with mime types
AddEncoding -- associates file names or extensions with encoding types
Affix -- loads an Ispell affix file
Alias -- associates master and mirror sites
AliasProg -- calls an external URL rewrite program
Allow --  allows to index the documents with the given URL pattern
AlwaysFoundWord -- defines a word that is treated as found in any document
AuthBasic -- defines user name and password for basic HTTP authorization
BrowserCharset -- defines browser character set
Cache (obsolete) -- defines whether to enable search result cache
CaseFolding -- chooses an alternative case mapping
Category -- binds a group of documents to a category
CheckMP3 -- checks for MP3 meta information
CheckMP3Only -- checks for MP3 meta information
CheckOnly -- checks if a document exists
CollectLinks -- defines whether to store links between documents, for popularity rank.
ComplexSynonyms -- defines whether to use phrase-to-word and phrase-to-phrase synonyms
CrawlDelay -- defines the number of seconds to wait between requests to the same server
CrawlerThreads -- sets the number of indexer threads started for crawling
CrossWords -- specifies whether to use crosswords
CustomLog --  enables logging to STDOUT using the given format
CVSIgnore -- defines whether to index internal CVS files
DateFactor -- gives lower score to old documents
DateFormat -- defines date format
DBAddr -- sets the database connection string
DefaultContentType -- defines default Content-Type
Dehyphenate -- enables searching for dehyphenated forms of compound words
DefaultLang -- defines default language
DetectClones -- enables or disables clone detection
Disallow -- disallows indexing defined URLs
DocMemCacheSize -- this command is obsolete
DocSizeWeight -- changes document size impact on the document score
DocTimeOut -- defines maximum amount of time spent to download a document
ExcerptSize -- defines maximal excerpt length
ExcerptStopword -- defines whether to highlight stopwords.
ExcerptPadding -- defines excerpt context length
FlushServerTable -- puts the server.active value in sync with indexer.conf
FollowSymLinks -- defines whether to dereference symlinks
ForceIISCharset1251 -- assume that Microsoft IIS servers return windows-1251 character set
GuesserUseMeta -- defines whether to use meta tags for character set detection
GroupBySite -- enables grouping search results by site
HlBeg -- defines left search results highlighting code
HlEnd -- defines right search results highlighting code
HoldBadHrefs -- defines period of time to keep bad documents in the database
HrefOnly -- scans matching documents for links only
HTDBAddr -- describes a connection string to a remote SQL data source
HTDBDoc -- describes a query to fetch a document content from an SQL source
HTDBLimit -- limits the amount of document IDs fetched in a single HTDBList query
HTDBList -- describes a query to fetch document list from an SQL data source
HTTPHeader -- adds a desired header into HTTP requests
IDFFactor -- changes the effect of inverse document frequency
ImportEnv -- imports an environment variable
Include -- includes additional configuration file
Index -- defines whether the document content should be indexed
IndexIf -- allows indexing documents whose section matches the given pattern
IndexTime --  Defines in the Last-Modified HTTP header should be processed for date detection
IspellUsePrefixes -- allows to use Ispell prefixes at search time
LangMapFile -- loads language map for character set and language guesser
LangMapUpdate -- activates updating of the loaded language maps
Limit -- describes a fast limit
LoadURLBasicInfo -- defines whether to load basic section values to display in search results
LoadChineseList -- loads a Chinese frequency dictionary
LoadTagInfo -- loads tag values to display in search results
LoadThaiList -- loads a Thai word frequency dictionary
LoadURLInfo -- loads extended section values to display in search results
LocalCharset -- defines local character set
Locale -- sets a desired locale
Log2Stderr -- Defines whether to print messages to STDERR
LogLevel -- sets verbosity level
MaxDocSize -- defines maximal document size
MaxDocPerSite -- defines maximal document number to pick up from every site
MaxHops -- defines maximal way in "mouse clicks"
MaxNetErrors -- defines maximal network errors
MaxWordLength -- defines maximal word length
Mime -- defines external parser for given mime-type
MinCoordFactor -- gives more score to documents having query words closer to the beginning
MinWordLength -- defines minimal word length
MirrorHeadersRoot -- defines root directory for mirrored document headers
MirrorPeriod -- defines fresh period for mirrored files
MirrorRoot -- defines root directory for mirrored documents
NetErrorDelayTime -- defines document processing delay
NewsExtensions -- enables news extensions
NoIndexIf -- disallows indexing documents having a section matching a pattern.
NumSections -- tells the number of sections configured in indexer.conf
NumDistinctWordFactor -- gives more score to documents having more distinct words
NumWordFactor -- gives more score to documents having more found words
PagesPerScreen --  defines the number of search result page links.
ParserTimeOut -- defines maximum allowed parser execution time
Period -- defines reindex period
PopRankFeedBack -- uses sites weights when calculating Popularity Rank
PopRankShowCntRatio -- defines PopRankUseShowCnt threshold
PopRankShowCntWeight -- defines PopRankUseShowCnt strength
PopRankSkipSameSite -- skips links from same site
PopRankUseShowCnt -- PopRankUseShowCnt
PopRankUseTracking -- defines if a site appearing in search results oftener gets higher Popularity Rank weight
Proxy -- defines HTTP proxy address
ProxyAuthBasic -- defines HTTP proxy user name and password
R0 - R9 -- sets random number range
ReadTimeOut -- defines stalled connections timeout
Realm -- describes Web-space for indexing, using regex/wild patterns
RemoteCharset -- defines default character set for Server or Ream
RemoteFileNameCharset -- defines default character set of file and directory names
ReplaceVar -- creates or modifies a variable
ResultContentType -- specifies the Content-Type header produced by search.cgi
ResultsLimit -- sets the maximum number of results displayed
ReverseAlias -- rewrites URL before inserting to the database
Robots -- defines whether to use robots.txt
SaveSectionSize -- defines whether to store section sizes for better relevancy quality
Section -- defines a document section
Server -- describes Web-space for indexing
ServerTable -- loads servers to index from the database
ServerWeight -- defines server weight for Popularity Rank calculation
Skip -- skips visiting the documents with URL matching the given pattern
SkipIf -- skip revisiting the documents with a section matching the given pattern
Spell -- loads an Ispell dictionary file
SQLWordForms -- loads synonyms or word forms from the database
StartHops -- defines Hops value for start URLs
StopwordFile -- loads stopwords file
StrictModeThreshold -- threshold to switch to a less strict search mode
StripAccents -- converts letters to their non-accented counterparts
Subnet -- Subnet
SubstringMatchMinWordLength -- defines minimal word length allowed for substring match
Suggest -- Display misspelled search word suggestions
Synonym -- loads a synonym list from a file
SyslogFacility -- sets syslog facility
Tag -- assigns a generic grouping tag to a set of documents
URL -- inserts URL into database
UserCacheQuery -- stores a search result to the database using a user-defined SQL query
URLDataThreshold -- improves search performance for queries returning a small number of results
URLSelectCacheSize -- sets URL cache size for indexer
URLSelectSkipLock -- defines whether to skip locking URLs when fetching crawling targets from the database
UseCookie -- defines whether to use per-session cookies during indexing
UseLocalCachedCopy -- whether to use the original document as a source for excerpts and Cached Copy
UseCRC32URLId -- defines whether to use CRC32 for URL ID generation
UseNumericOperators -- defines whether to interpret numeric operators in a search query
UseRangeOperators -- defines whether to recognize range operators in a search query
UseRemoteContentType -- specifies whether to trust the Content-Type HTTP header from the remote servers
UserOrder -- specifies an SQL query for user defined ordering
UseSitemap -- defines whether to use Sitemap Protocol when crawling
UserScore -- specifies an SQL query to calculate user defined score for desired documents.
UserSiteScore -- specifies an SQL query to calculate user defined score for certain sites.
UserScoreFactor -- sets the effect of the UserScore command
VarDir -- defines mnoGoSearch working directory
VaryLang -- defines languages for multilingual indexing
wf -- sets the default weights for different document parts
WordCacheSize -- defines maximum allowed in-memory words cache size
WordDensityFactor -- gives more score to documents having higher word density
WordFormFactor -- gives more score to the original query word form (as opposite to Synonym or Ispell fuzzy forms)
WordDistanceWeight -- changes word distance impact on the document score
II. mnoGoSearch C API function reference
UdmEnvInit -- Allocates or initializes a search context variable
UdmEnvFree -- Closes a search context
UdmAgentInit -- Allocates or initializes a search session variable
UdmAgentFree -- Closes a search session
UdmAgentAddLine -- Adds a configuration command
UdmFind2 -- Executes a search query
UdmResultFree -- Frees a search result
A. mnoGoSearch change history
Changes in 3.3
Changes in 3.3.12 (December 15, 2011)
Changes in 3.3.11 (January 27, 2011)
Changes in 3.3.10 (November 23, 2010)
Changes in 3.3.9 (29 October 2009)
Changes in 3.3.8 (13 February 2009)
Changes in 3.3.7 (11 April 2008)
Changes in 3.3.6 (27 November 2007)
Changes in 3.3.5 (17 October 2007)
Changes in 3.3.4 (27 July 2007)
Changes in 3.3.3 (8 May 2007)
Changes in 3.3.2 (19 April 2007)
Changes in 3.3.1 (18 March 2007)
Changes in 3.3.0 (06 March 2007)
Index
List of Tables
3-1. Verbose levels
9-1. Supported character sets
9-2. Character set aliases
10-1. Available search parameters
12-1. Environment variables mnoGoSearch understands
12-2. server table schema
12-3. Server parameters in the table srvinfo.
List of Examples
1. UdmEnvInit example #1
2. UdmEnvInit example #2
1. UdmEnvFree example #1
2. UdmEnvFree example #2
1. UdmAgentInit example #1
2. UdmAgentInit example #2
1. UdmAgentFree example #1
2. UdmAgentFree example #2
1. UdmAgentAddLine example
1. UdmFind2 example
2. UdmFind2 - a complete search application example
3. Makefile example
1. UdmResultFree example #1
  Next
  Introduction