This page was generated
March  13,  2012
4:49  AM
XQuery & XSLT Built-In & Modules Function Reference

Module: Custom Dictionary Management

The custom dictionary functions are designed to help you manage dictionaries that customize the stemming and tokenization in MarkLogic Server. The custom dictionary function module is installed as the following file:

  • install_dir/Modules/MarkLogic/custom-dictionary.xqy

where install_dir is the directory in which MarkLogic Server is installed.

To use the custom-dictionary.xqy module in your own XQuery modules, include the following line in your XQuery prolog:

import module namespace cdict = "http://marklogic.com/xdmp/custom-dictionary" at "/MarkLogic/custom-dictionary.xqy";

The library namespace prefix cdict is not predefined in the server.

Changes effect stemming and tokenization results immediately. Queries started after a custom dictionary is written or deleted will use the new behavior.

Documents are not automatically reindexed after a custom dictionary change. To get accurate results for stemmed searches, documents must be reindexed. If it is not practical to reindex all documents, use this process to selectively reindex affected documents:

  1. Collect the words which will be affected by the change. These are the contents of the word elements which will be added, deleted, or have their stems changed.
  2. Search for the documents which contain these words and save the URIs.
  3. Update the custom dictionaries.
  4. Make a idempotent update to each of the documents in the list. This might be adding then deleting an element to each of them. This will cause each document to be reindexed.

Japanese ("ja"), Simplified Chinese ("zh"), and Traditional Chinese ("zh_Hant") use a linguistic tokenizer to divide text into tokens (words and punctuation). The custom dictionary affects the tokenizer for these languages. For Japanese, it also affects the stemmer. For all of these languages, a custom dictionary entry may have an optional cdict:pos element to give the part of speech for that word. The common codes for part of speech are as follows:

cdict:pos Value Part of Speech
Adj Adjective
Adv Adverb
Interj Interjection
Nn Noun
Nn-Prop Proper noun (default value for pos)
Verb Verb

Other supported languages tokenize based on spaces and punctuation. For these languages, the custom dictionary only affects the stemmer. If a pos element is provided, it is ignored.

Function Summary
cdict:dictionary-delete Delete the custom dictionary for $lang, an ISO language code for a licensed language.
cdict:dictionary-read If $lang matches a licensed language with a custom dictionary, the custom dictionary from the local host is returned.
cdict:dictionary-write $lang is an ISO language code.
cdict:get-languages Return the ISO language codes for all licensed languages.
Function Detail
cdict:dictionary-delete(
$lang as xs:string
)  as   empty-sequence()
Summary:

Delete the custom dictionary for $lang, an ISO language code for a licensed language. Returns an empty sequence. Raises an XDMP-LANG error if $lang is not a licensed language.

Parameters:
$lang : The ISO language code of the dictionary.

Required Privilege:

http://marklogic.com/xdmp/privileges/custom-dictionary-admin

Example:
  xquery version "1.0-ml";
  import module namespace cdict = "http://marklogic.com/xdmp/custom-dictionary" 
		  at "/MarkLogic/custom-dictionary.xqy";

  cdict:dictionary-delete("en")
  

cdict:dictionary-read(
$lang as xs:string
)  as   element(cdict:dictionary)?
Summary:

If $lang matches a licensed language with a custom dictionary, the custom dictionary from the local host is returned. The dictionary will have an xml:lang attribute for the language. If there is no custom dictionary for that language, an empty sequence is returned. Raises an XDMP-LANG error if $lang is not a licensed language.

Parameters:
$lang : The ISO language code of the dictionary.

Required Privilege:

http://marklogic.com/xdmp/privileges/custom-dictionary-user

Example:
  xquery version "1.0-ml";
  import module namespace cdict = "http://marklogic.com/xdmp/custom-dictionary" 
		  at "/MarkLogic/custom-dictionary.xqy";

  cdict:dictionary-read("en")

  => <cdict:dictionary 
          xmlns:cdict="http://marklogic.com/xdmp/custom-dictionary"
          xml:lang="en">
       <cdict:entry>
         <cdict:word>Furbies</cdict:word>
         <cdict:stem>Furby</cdict:stem>
       </cdict:entry>
       <cdict:entry>
         <cdict:word>servlets</cdict:word>
         <cdict:stem>servlet</cdict:stem>
       </cdict:entry>
     </cdict:dictionary>
  

cdict:dictionary-write(
$lang as xs:string,
$dict as element(cdict:dictionary)
)  as   empty-sequence()
Summary:

$lang is an ISO language code. $dict is the custom dictionary. If $lang matches a licensed language and $dict validates, the custom dictionary is installed on the cluster. Returns an empty sequence. Raises an XDMP-LANG error if $lang is not a licensed language. Raises validation errors if the dictionary fails to validate.

Parameters:
$lang : The ISO language code of the dictionary.
$dict : A custom dictionary.

Required Privilege:

http://marklogic.com/xdmp/privileges/custom-dictionary-admin

Example:
  xquery version "1.0-ml";
  import module namespace dict = "http://marklogic.com/xdmp/custom-dictionary" 
		  at "/MarkLogic/custom-dictionary.xqy";

  let $dict := xdmp:document-get("/var/tmp/cdict-en.xml")/*
  return 
    cdict:dictionary-write("en",$dict)
  

cdict:get-languages(  ) as  xs:string*
Summary:

Return the ISO language codes for all licensed languages.

Required Privilege:

http://marklogic.com/xdmp/privileges/custom-dictionary-user

Example:
  xquery version "1.0-ml";
  import module namespace cdict = "http://marklogic.com/xdmp/custom-dictionary" 
		  at "/MarkLogic/custom-dictionary.xqy";

  cdict:get-languages()

  ==> ("en", "ja", "zh", "zh_Hant")