|
The custom dictionary functions are designed to help you manage dictionaries
that customize the stemming and tokenization in MarkLogic Server.
The custom dictionary function module is installed as the following file:
install_dir/Modules/MarkLogic/custom-dictionary.xqy
where install_dir is the directory in which
MarkLogic Server is installed.
To use the custom-dictionary.xqy module in your own XQuery modules,
include the following line in your XQuery prolog:
import module namespace cdict = "http://marklogic.com/xdmp/custom-dictionary"
at "/MarkLogic/custom-dictionary.xqy";
The library namespace prefix cdict is not predefined in
the server.
Changes effect stemming and tokenization results immediately.
Queries started after a custom dictionary is written or deleted will use the
new behavior.
Documents are not automatically reindexed after a custom dictionary change.
To get accurate results for stemmed searches, documents must be reindexed.
If it is not practical to reindex all documents, use this process to selectively
reindex affected documents:
- Collect the words which will be affected by the change. These are
the contents of the
word elements which will be added,
deleted, or have their stems changed.
- Search for the documents which contain these words and save the URIs.
- Update the custom dictionaries.
- Make a idempotent update to each of the documents in the list.
This might be adding then deleting an element to each of them. This will
cause each document to be reindexed.
Japanese ("ja"), Simplified Chinese ("zh"), and Traditional
Chinese ("zh_Hant") use a linguistic tokenizer to divide text into
tokens (words and punctuation).
The custom dictionary affects the tokenizer for these languages. For Japanese,
it also affects the stemmer. For all of these languages,
a custom dictionary entry may have an optional
cdict:pos element to give the part of
speech for that word. The common codes for part of speech are as follows:
cdict:pos Value |
Part of Speech |
| Adj |
Adjective |
| Adv |
Adverb |
| Interj |
Interjection |
| Nn |
Noun |
| Nn-Prop |
Proper noun (default value for pos) |
| Verb |
Verb |
Other supported languages tokenize based on spaces and punctuation.
For these languages, the custom dictionary only affects the stemmer.
If a pos element is provided, it is ignored.
|