This page was generated
July 7, 2010
4:03 PM
XQuery Built-In and Modules Function Reference

Built-In: Classifier

The classifier built-in functions perform automatic classification of documents using training data. The classifiers that result from training are represented in XML. The classifier APIs and the XML output from cts:train conform to the classifier.xsd schema, located in the Config directory under the directory in which MarkLogic Server is installed.

Function Summary
cts:classify Classifies a sequence of nodes based on training data.
cts:thresholds Compute precision, recall, the F measure, and thresholds for the classes computed by the classifier, by comparing with the labels for the same set.
cts:train Produces a set of classifiers from a list of labeled training documents.
Function Detail
cts:classify(
$data-nodes as node()*,
$classifier as element(cts:classifier),
$options as element()?,
$training-nodes as node()*
)  as  element(cts:label)*
Summary:

Classifies a sequence of nodes based on training data. The training data is in the form of a classifier specification, which is generated from the output of cts:train. Returns labels for each of the input documents in the same order as the input document.

Parameters:
$data-nodes : The sequence of nodes to be classified.
$classifier : An element node containing the classifier specification. This is typically the output of cts:train, either run directly or saved in an XML document in the database.
$options :

An options element. The options for classification are passed automatically from cts:train to the cts:classifier specification as part of the classifier element so that they are consistent with the parameters used in training. The following option may be separately passed to cts:classify and is in the cts:classify namespace:

<thresholds>

A definition of the thresholds to use in classification. This is a complex element with one or more <threshold> children. You can specify both a global value and per-class values (as computed from cts:thresholds). The global value will apply to any classes for which a per-class value is not specified. For example:
   <options xmlns="cts:classify">
     <thresholds>
       <threshold>-1.0</threshold>
       <threshold class="Example 1">-2.42</threshold>
     </thresholds>
   </options>
   
$training-nodes : The sequence of training nodes used to train the classifier. Required if the supports form of the classifier is used; ignored if the weights form of the classifier is used.

Usage Notes:

cts:classify classifies a sequence of nodes using the output from cts:train. The $data-nodes and $classifier parameters are respectively the nodes to be classified and the specification output from cts:train. cts:classify can use either supports or weights forms of the $classifier output from cts:train (see Output Formats). If the supports form is used, the training nodes must be passed as the 4th parameter. The $options parameter is an options element in the cts:classify namespace.

The output is a sequence of label elements of the form:

<cts:label> 
  <cts:class name="Example 1" val="-0.003"/>
  <cts:class name="Example 2" val="1.4556"/>
  ...
</cts:label>

Each label corresponds to the data node in the corresponding position in the input sequence. There will be a <class> child for each class where the document passed the class threshold. The val attribute gives the class membership value for the data node in the given class. Values greater than zero indicate likely class membership, values less than zero indicate likely non-membership. Adjusting thresholds can give more or less selective classification. Increasing the threshold leads to a more selective classification (that is, decreases the likelihood of classification in the class). Decreasing the threshold gives less selective classification.


Example:
let $firsthalf := xdmp:directory("/shakespeare/plays/", "1")[1 to 19]
let $secondhalf := xdmp:directory("/shakespeare/plays/", "1")[20 to 37]
let $classifier :=  
  let $labels := for $x in $firsthalf 
         return
         <cts:label>
           <cts:class name={xdmp:document-properties(xdmp:node-uri($x))
                 //playtype/text()}/>
         </cts:label>
  return
  cts:train($firsthalf, $labels, 
          <options xmlns="cts:train">
            <classifier-type>supports</classifier-type>
          </options>)
return
cts:classify($secondhalf, $classifier, 
             <options xmlns="cts:classify"/>,
             $firsthalf)

  => ( <label>...</label>,... )


cts:thresholds(
$computed-labels as element(cts:label)*,
$known-labels as element(cts:label)*,
[$recall-weight as xs:double]
)  as  element(cts:thresholds)?
Summary:

Compute precision, recall, the F measure, and thresholds for the classes computed by the classifier, by comparing with the labels for the same set.

Parameters:
$computed-labels : A sequence of element nodes containing the labels from classification (the output from cts:classify) for a set of documents.
$known-labels : A sequence of element nodes containing the known labels for the same set of documents.
$recall-weight (optional): The factor to use in the calculation of the F measure. The number should be non-negative. A value of 0 means F is just precision and a value of +INF means F is just recall. The default is 1, which gives the harmonic mean between precision and recall.

Usage Notes:

You use the output of cts:thresholds to determine the best thresholds values for your data, based on the first pass through the first part of your training data. The output of cts:thresholds provides you with precision and recall measurements at the calculated thresholds for each class. The following are the definitions of the attributes of the thresholds element returned by cts:thresholds:

name

The name of the class.

threshold

The threshold that is computed by the classifier to give the best results. The threshold is used by cts:classify when classifying documents, and is defined to be the positive or negative distance from the hyperplane which represents the edge of the class.

precision

A number which represents the fraction of nodes identified in a class that are actually in that class. As this aproaches 1, there is a higher probability that you over-classified.

recall

A number which represents the the fraction of nodes in a class that were identified by the classifier as being in that class. As this aproaches 1, there is a higher probability that you under-classified.

F (the F-measure)

A measure which represents if the classification at the given threshold is closer to recall or closer to precision. A value of 1 indicates that precision and recall have equal weight. A value of 0.5 indicates that precision is weighted 2x recall. A value of 2 indicates that recall is weighted 2x prcision. A value of 0 indicates that the weighting is precision only, and a value of +INF (xs:double('+INF')) indicates that weighting is recall only.

Example:
let $firsthalf := xdmp:directory("/shakespeare/plays/", "1")[1 to 19]
let $secondhalf := xdmp:directory("/shakespeare/plays/", "1")[20 to 37]
let $firstlabels := for $x in $firsthalf 
        return
        <cts:label>
          <cts:class name={xdmp:document-properties(xdmp:node-uri($x))
                                     //playtype/text()}/>
        </cts:label>
let $secondlabels := for $x in $secondhalf 
        return
        <cts:label>
          <cts:class name={xdmp:document-properties(xdmp:node-uri($x))
                                     //playtype/text()}/>
        </cts:label>
let $classifier :=  
    cts:train($firsthalf, $firstlabels, 
      <options xmlns="cts:train">
        <classifier-type>supports</classifier-type>
      </options>)
let $classifysecond :=
  cts:classify($secondhalf, $classifier, 
        <options xmlns="cts:classify"/>,
        $firsthalf)
return
cts:thresholds($classifysecond, $secondlabels)
(: 
   This returns the computed thresholds for the second half of 
   the plays in a Shakespeare database, based on a classifier
   trained with the first half of the plays.  For example:

<thresholds xmlns="http://marklogic.com/cts">
  <class name="TRAGEDY" threshold="0.221948" precision="1" 
         recall="0.666667" f="0.8" count="3"/>
  <class name="COMEDY" threshold="0.114389" precision="0.916667" 
         recall="1" f="0.956522" count="11"/>
  <class name="HISTORY" threshold="0.567648" precision="1" 
         recall="1" f="1" count="4"/>
</thresholds>
:)

cts:train(
$training-nodes as node()*,
$labels as element(cts:label)*,
[$options as element()?]
)  as  element(cts:classifier)?
Summary:

Produces a set of classifiers from a list of labeled training documents.

Parameters:
$training-nodes : The sequence of training nodes. These are nodes that represent members of the classes.
$labels : A sequence of labels for the training nodes, in the order corresponding to the training nodes.
$options (optional): An XML representation of the options for defining the training parameters. The options node must be in the cts:train namespace. The following is a sample options node:
    <options xmlns="cts:train">
      <classifier-type>supports</classifier-type>
      <kernel>geodesic</kernel>
    </options> 

The cts:train options include:

<classifier-type>

A string defining the kind of classifier to produce, either weights or supports. The default is weights.

<kernel>

A string defining which function to use for comparing documents. The default is sqrt. Normalization (the values that end in -normalized) brings document vectors into the unit sphere, which may improve the mathematical properties of the calculations. Possible values are:

simple

Model documents as 1 or 0 for presence or absence of each term.

simple-normalized

Like simple, but normalized by square root of document length.

sqrt

Model documents using the square root of the term frequencies.

sqrt-normalized

Like sqrt, but normalized by the sum of the term frequencies.

linear-normalized

Model documents as the term frequencies normalized by the square root of the sum of the squares of the term frequencies.

gaussian

Compare documents using the Gaussian of the term frequencies. Requires a classifier-type of supports.

geodesic

Compare documents using the Riemann geodesic distance over term frequencies. Requires a classifier-type of supports.

<max-terms>

An integer defining the maximum number of terms to use to represent each document. If a positive number M is given, then the M most discriminating terms are used; other terms are dropped. The default is 0 (unlimited).

<max-support>

A double specifying the maximum influence a single training node can have. This parameter has a strong influence on performance. The default value of 1.0 should work well in most cases. Larger values means greater sensitivity and may improve accuracy on small datasets, but give longer running times. Smaller values mean less sensitivity and better resistance to mis-classified documents, and shorter running times.

<min-weight>

A double specifying the minimum weight a term can have and still be considered for inclusion in the term vector. This parameter only applies to the term weight form of the classifier. Smaller values mean longer term vectors and as a consequence longer running times and greater memory consumption during classification, but may also improve accuracy. The default is is 0.01.

<tolerance>

How close the final solutions to the constraint equations must be. Smaller values lead to a greater number of iterations and longer running times. Larger values lead to less precise classification. The default is 0.01.

<epsilon>

How close a value must be to 0 to be counted as equal to 0. Since double arithmetic is not precise, setting this value to exactly 0 will likely lead to non-convergence of the algorithm. Smaller values lead to a greater number of iterations and longer running times. Larger values lead to less precise classification. The default is 0.01.

<max-iterations>

The maximum number of iterations of the constraint satisfaction algorithm to run. The algorithm usually converges very quickly, so this parameter usually has no effect unless it is set very low. The default is 500.

<thresholds>

A definition of the thresholds to use in classification. This is a complex element with one or more <threshold> children. You can specify both a global value and per-class values (as computed from cts:thresholds). The global value will apply to any classes for which a per-class value is not specified. For example:
    <options xmlns="cts:train">
      <thresholds>
        <threshold>-1.0</threshold>
        <threshold class="Example 1">-2.42</threshold>
      </thresholds>
    </options>
    

For the initial tuning phase of training your data, leave the value of this parameter at its default value which is a very large negative number (-10E30). This will allow you to accurately compute the threshold values when you run cts:thresholds on the initial training data. Then you can use the calculated thresholds values when you run the secondary pass through the second part of your training data.

The options element also includes indexing options in the http://marklogic.com/xdmp/database namespace. These control which terms to use. Note that the use of certain options, such as fast-case-sensitive-searches, will not impact final results unless the term vector size is limited with the max-terms option. Other options, such as phrase-throughs, will only generate terms if some other option is also enabled (in this case fast-phrase-searches).

These database options include the following (shown here with a db prefix to denote the different namespace, as declared in the example below):

<db:word-searches>

Include terms for the words in the node.

<db:stemmed-searches>

Include terms for the stems in the node.

<db:fast-case-sensitive-searches>

Include terms for case-sensitive variations of the words in the node.

<db:fast-diacritic-sensitive-searches>

Include terms for diacritic-sensitive variations of the words in the node.

<db:fast-phrase-searches>

Include terms for two-word phrases in the node.

<db:phrase-throughs>

If phrase terms are included, include terms for phrases that cross the given elements.

<db:phrase-arounds>

If phrase terms are included, include terms for phrases that skip over the given elements.

<db:fast-element-word-searches>

Include terms for words in particular elements.

<db:fast-element-phrase-searches>

Include terms for phrases in particular elements.

<db:element-word-query-throughs>

Include terms for words in sub-elements of the given elements.

<db:fast-element-character-searches>

Include terms for characters in particular elements.

<db:range-element-indexes>

Include terms for data values in specific elements.

<db:range-element-attribute-indexes>

Include terms for data values in specific attributes.

<db:one-character-searches>

Include terms for single character.

<db:two-character-searches>

Include terms for two-character sequences.

<db:three-character-searches>

Include terms three-character sequences.

<db:trailing-wildcard-searches>

Include terms for trailing wildcards.

<db:fast-element-trailing-wildcard-searches>

If trailing wildcard terms are included, include terms for trailing wildcards by element.

<db:fields>

Include terms for the defined fields.

Usage Notes:

The elements in the label sequence should match one for one with the nodes in the training node sequence. The first label element describes the first node in the training node sequence, the second label element describes the second node in the training node sequence, and so on. If there are more labels than training nodes or more training nodes than labels, an error is raised.

The format of each label element is:

  <cts:label name="Node1">
    <cts:class name="Example1"/>
    <cts:class name="Example2" val="-1"/>
        :   :
  </cts:label>

Each class listed indicates whether the corresponding node in the training sequence is in the given class. Examples are taken to be positive examples unless specified otherwise (with a val attribute of -1). The document is assumed to be a negative example of any classes that are not explicitly listed. The name attribute on the label element is an optional name for the labelled node. It is purely for human consumption to help in tuning the classification parameters.

Output Formats

A linear classifier is defined by a weight vector w on terms, and an offset value b. The <weights/> node encodes the weight vector directly. Its children are the classes, and each class includes a list of terms. The term node uses an internal id to identify the term and a term weight:

<weights>
  <class name="Example1" offset="2.04">
    <term id="43587329645324245" val="0.3423432"/>
    <term id="47893427895432534" val="-0.12345556"/>
      :                           :
  </class>
      :
</weights>

The weight vector w is a linear combination of the documents themselves, and it may be more convenient to express the classifier in this way. For instance, if the number of terms is not limited, the <weights/> node will be extremely large. The weight vector form may not be used if the classifier kernel is non-linear, that is, with the Gaussian or geodesic kernel.

The support vector representation of the classifier includes a supports node that has <class/> children for each class. Here the class elements contain a list of doc elements which identify the specific training nodes using an internal key. This internal key is valid across queries only for nodes in the database. Each doc element has an attribute encoding the weight of that document and an error attribute which shows how well the document fit the classifier. Large positive or negative errors (greater than about 1.5) are potentially mis-classified documents.

<supports>
  <class name="Example1" offset="2.04">
    <doc id="155584958759" name="Node102" val="-0.00334163" err="1.4"/>
    <doc id="594064848864" name="Node57" val="0.025341234" err="-2.3"/>
      :                             :
  </class>
      :    
</supports>

Each class is identified by a unique name.


Example:
let $firsthalf := xdmp:directory("/shakespeare/plays/", "1")[1 to 19]
let $labels := for $x in $firsthalf 
         return
         <cts:label>
           <cts:class name={xdmp:document-properties(xdmp:node-uri($x))
                //playtype/text()}/>
          </cts:label>
return
cts:train($firsthalf, $labels, 
       <options xmlns="cts:train">
         <classifier-type>supports</classifier-type>
       </options>)

  =>  <cts:classifier>...