Chapter 9. Internationalization

Table of Contents
Character sets
Search pages with multi-lingual interface
Segmenters for Chinese, Thai and Japanese languages
Indexing multilingual servers

Character sets

Supported character sets

mnoGoSearch supports almost all known 8 bit character sets as well as the most widely used multi-byte character sets including Korean EUC-KR, Chinese Big5 and GB2312, Japanese Shift-JIS, EUC-JP and ISO-2022-JP, as well as UTF-8. Some multi-byte character sets are not supported by default, because the conversion tables for them are large which makes the size of executable programs larger. See configure parameters to enable support for extra character sets.

mnoGoSearch also supports the following Macintosh character sets: MacCE, MacCroatian, MacGreek, MacRoman, MacTurkish, MacIceland, MacRomania, MacThai, MacArabic, MacHebrew, MacCyrillic, MacGujarati.

Table 9-1. Supported character sets

LanguagesCharacter sets
Western Europe: Albanian, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, Galician, German, Icelandic, Italian, Norwegian, Portuguese, Spanish, Swedish ASCII 8, CP437, CP850, CP860, CP1252, ISO 8859-1, ISO 8859-15, MacRoman, MacIceland
Eastern Europe: Croatian, Czech, Hungarian, Polish, Romanian, Slovak, Slovene CP852, CP1250, ISO 8859-2, MacCentralEurope, MacRomania, MacCroatian
Baltic: Latvian, Lithuanian, EstonianCP1257, ISO-8859-4, ISO-8859-13
Cyrillic: Bulgarian, Belorussian, Macedonian, Russian, Serbian, UkrainianCP855, CP866, CP1251, ISO 8859-5, Koi8-r, Koi8-u, MacCyrillic
ArabicCP864, CP1256, ISO 8859-6, MacArabic
GreekCP869, CP1253, ISO 8859-7, MacGreek
HebrewCP1255, ISO 8859-8, MacHebrew
TurkishCP857, CP1254, ISO 8859-9, MacTurkish
JapaneseShift-JIS, EUC-JP, ISO-2022-JP
Simplified ChineseGB2312
Traditional ChineseBig5
ThaiCP874, TIS 620, MacThai
IndianMacGujarati, TSCII
Unicode: over 650 languagesUTF-8

Multiple languages in the same database

mnoGoSearch allows to index documents in different languages into the same database. Disk space, required to store search data, depends on the choice of the character set that mnoGoSearch uses to store data. The character set is specified using the LocalCharset command.

Character set conversion

indexer converts all documents to the character set specified in the LocalCharset command in indexer.conf . Internally conversion is implemented using Unicode.

mnoGoSearch performs character conversion in loss-less manner. Usually, conversion between different character sets can loose some data. For example, conversion of a text file from Greek cp1253 to Russian cp1251 will loose all Greek characters. To avoid data loss, mnoGoSearch stores all characters which cannot be simply covered to LocalCharset using &#nnn; notation, where nnn is the decimal code point of a character, according to Unicode.

To avoid excessive use of disk space which can be caused by a huge amount of the &#nnn; sequences (each requires from 5 to 7 bytes) it's important to choose a good value for LocalCharset. If your document collection consists of documents in many scripts, like Greek and Russian and German, UTF-8 is usually the best choice for LocalCharset.

Character set conversion at search time

You can specify the BrowserCharset command to choose the character set which will be used to display search results. If BrowserCharset and LocalCharset have different values, mnoGoSearch will apply character set conversion. Similar to indexing time, if some characters cannot be converted to BrowserCharset, they will be displayed using &nnn; notation.

Character sets aliases

Every character set is recognized by a number of its aliases. Different web servers can return the same charset using different notations. For example, ISO-8859-2, ISO8859-2, latin2 are the names same of the same character set. mnoGoSearch understands the following character set name aliases:

Table 9-2. Character set aliases

ISO-8859-1: CP819, CSISOLATIN, IBM819, ISO-8859-1, ISO-IR-100, ISO_8859-1, ISO_8859-1:1987, L1, LATIN1
ISO-8859-10: CSISOLATIN6, ISO-8859-10, ISO-IR-157, ISO_8859-10, ISO_8859-10:1992, L6, LATIN6
ISO-8859-11: ISO-8859-11, TIS-620, TIS620, TACTIS
ISO-8869-13: ISO-8859-13, ISO-IR-179, ISO_8859-13, L7, LATIN7
ISO-8859-14: ISO-8859-14, ISO-IR-199, ISO_8859-14, ISO_8859-14:1998, L8, LATIN8
ISO-8859-15: ISO-8859-15, ISO-IR-203, ISO_8859-15, ISO_8859-15:1998
ISO-8859-16: ISO-8859-16, ISO-IR-226, ISO_8859-16, ISO_8859-16:2000
ISO-8859-2: CSISOLATIN2, ISO-8859-2, ISO-IR-101, ISO_8859-2, ISO_8859-2:1987, L2, LATIN2
ISO-8859-3: CSISOLATIN3, ISO-8859-3, ISO-IR-109, ISO_8859-3, ISO_8859-3:1988, L3, LATIN3
ISO-8859-4: CSISOLATIN4, ISO-8859-4, ISO-IR-110, ISO_8859-4, ISO_8859-4:1988, L4, LATIN4
ISO-8859-5:CSISOLATINCYRILLIC, CYRILLIC, ISO-8859-5, ISO-IR-144, ISO_8859-5, ISO_8859-5:1988
ISO-8859-6: ARABIC, ASMO-708, CSISOLATINARABIC, ECMA-114, ISO-8859-6, ISO-IR-127, ISO_8859-6, ISO_8859-6:1987
ISO-8859-7: CSISOLATINGREEK, ECMA-118, ELOT_928, GREEK, GREEK8, ISO-8859-7, ISO-IR-126, ISO_8859-7, ISO_8859-7:1987
ISO-8859-8: CSISOLATINHEBREW, HEBREW, ISO-8859-8, ISO-IR-138, ISO_8859-8, ISO_8859-8:1988
ISO-8859-9: CSISOLATIN5, ISO-8859-9, ISO-IR-148, ISO_8859-9, ISO_8859-9:1989, L5, LATIN5
armscii-8:ARMSCII-8, ARMSCII8
cp1250: CP1250, MS-EE, WINDOWS-1250
cp1251: CP1251, MS-CYRL, WINDOWS-1251
cp1252: CP1252, MS-ANSI, WINDOWS-1252
cp1253: CP1253, MS-GREEK, WINDOWS-1253
cp1254: CP1254, MS-TURK, WINDOWS-1254
cp1255: CP1255, MS-HEBR, WINDOWS-1255
cp1256: CP1256, MS-ARAB, WINDOWS-1256
cp1257: CP1257, WINBALTRIM, WINDOWS-1257
cp1258: CP1258, WINDOWS-1258
cp437: 437, CP437, IBM437
cp850: 850, CP850, CSPC850MULTILINGUAL, IBM850
cp852: 852, CP852, IBM852
cp855: 855, CP855, IBM855
cp857: 857, CP857, IBM857
cp860: 860, CP860, IBM860
cp861: 861, CP861, IBM861
cp862: 862, CP862, IBM862
cp863: 863, CP863, IBM863
cp864: 864, CP864, IBM864
cp865: 865, CP865, IBM865
cp866: 866, CP866, CSIBM866, IBM866
cp869: 869, CP869, IBM869, CP874, WINDOWS-874
GB2312: CHINESE, CSGB2312, CSISO58GB231280, GB2312, GB_2312-80, ISO-IR-58
koi8-r: CSKOI8R, KOI8-R, KOI8R
cp367: ANSI_X3.4-1968, ASCII, CP367, CSASCII, IBM367, ISO-IR-6, ISO646-US, ISO_646.IRV:1991, US, US-ASCII

Document character set detection

indexer detects document character set in this order:

  1. Content-type: text/html; charset=xxx - HTTP response readers.

  2. <META NAME="Content-Type" CONTENT="text/html; charset=xxx"> (for HTML documents) or

    <?xml version="1.0" encoding="xxx"?> (for XML documents)

    Note: Processing of the meta tags can be switched off by adding GuesserUseMeta no into indexer.conf.

  3. The default value, according to the command RemoteCharset of the corresponding Server or Realm command.

Automatic character set guesser

Starting with the version 3.2.0, mnoGoSearch has an automatic character set and language guesser. It currently recognizes more than 100 various character set and language combinations. Charset and language detection is implemented using the "N-Gram-Based Text Categorization" technique. There is a number of so called language map files, one for every language-charset pair. They are installed under /usr/local/mnogosearch/etc/langmap/ directory by default. Have a look into this directory to check the list of the currently provided character set-language pairs.

Note: Character set and language guesser works fine for the texts longer than 500 characters. Shorter texts may not be guessed so well.

Building your own language maps

To build your own language map use the mguesser utility. In addition, you'll need a set of text files with the sample texts (the models) for the desired language and character set. To create a new language map, run the following command:

mguesser -p -c charset -l language < FILENAME > language.charset.lm

You can also use mguesser to guess language and character set for a document using the existing language maps. Try the following command:

mguesser [-n maxhits] < FILENAME

You may want to create map files for different character sets for the same language. To convert a model file between character sets supported by mnoGoSearch, use the mconv utility, which is part of mnoGoSearch distribution.

mconv [OPTIONS] -f charset_from -t charset_to [configfile] < infile > outfile

By default, both mguesser and mconv utilities are installed into the /usr/local/mnogosearch/sbin/ directory.

Starting from the version 3.2.14, mnoGoSearch can update the existing language and character set maps automatically during indexing, if the remote server supplies pages with correctly specified language and character set. To enable this function, specify command

LangMapUpdate yes
in your indexer.conf.

The default character set

Use the RemoteCharset indexer.conf command to choose the default character set of the sites you index.

The default Language

You can also set the default language for the sites you index with help of the DefaultLang indexer.conf command.

Note: You can restricts search results to a specific language by using the g query string variable. Have a look into the Section called Search parameters in Chapter 11 for details.