Good XML design and performance

Good XML design and performance

by Evan Lenz

MarkLogic has always tried to ensure that well-designed XML performs well "as is" in MarkLogic Server. For example, if your schema uses descriptive, unique element names, that is not only going to make your application code clean and readable but it will be fast too. On the other hand, if your schema contains a lot of generic element names (such as "item") used in multiple ways, then it's going to make for harder-to-read code (in XQuery or XSLT), and it might also require you to do some extra leg work to get the best performance.

For example, consider a schema that has a lot of elements named <group> (or <section> or <item> or some other generic name) but which play very different roles—in this case indicated by the value of an attribute:

<doc>
  <group type="widget">
    <item type="sprocket">...</item>
    ...
  </group>
  <group type="employee">
    <item type="executive">...</item>
    ...
  </group>
  <group type="place">
    <item type="city">...</item>
    ...
  </group>
</doc>

Since MarkLogic indexes elements by their name, it is not automatically going to make a distinction between the various <group> elements you have, because they have the same name. That said, certain queries will still run maximally fast, such as when you want to restrict your results to a particular attribute value, using a simple XPath expression like this: //group[@type eq 'widget']. MarkLogic Server will use its Universal Index to avoid reading any documents that don't have a <group> element whose "type" attribute is equal to "widget". So we're okay so far.

But there are still a few issues here. For one thing, your code will not be very readable. This expression:

//group[@type eq 'widget']/item[@type eq 'sprocket']

is pretty noisy compared to, for example:

//widgets/sprocket

which is what your code would look like if you used more descriptive element names.

The other issue is that you may run into some problems when you want to start doing more advanced things, like word search in subsets of your documents. Specifically, if you want to restrict your search results to all group elements except widget groups, that will be challenging. (Fields can help you do the converse, but in that case you may have to enumerate all the ones you are interested in getting results for.)

Another issue with the above design is that, despite the potential benefit of being data-driven and extensible, it is not possible to apply schema constraints that are unique to specific classes of <group> elements (at least in W3C XML Schemas). You can't, for example, restrict the content of <group> elements to <sprocket> and <gear> elements only when its type attribute is "widget". If you want different content models, then you need to use different element names. Starting off with generic <group> elements may lead you down a slippery slope. You'll find yourself using other generic names like "item", and even then you won't be able to effectively restrict the "type" values to only the applicable ones.

Here's what an arguably better (and more readable) schema design would look like:

<doc>
  <widgets>
    <sprocket>...</sprocket>
    ...
  </widgets>
  <employees>
    <executive>...</executive>
    ...
  </employees>
  <places>
    <city>...</city>
    ...
  </places>
</doc>

To conclude, there are lots of good reasons to use descriptive, unique element names whenever possible. Doing so plays nicely with human readers, XQuery, XSLT, XML Schemas, and MarkLogic Server.

blogroll Blogroll

Comments

  • Would someone be able to tell me what programming language works better with MarLogic, .NET or Java? is there any link or arcticle available online I can check?
  • On the other hand consider div[contains(concat(' ', @class, ' '), ' widget ')]/..., the idiom common for processing XHTML, a moderately popular XML-based dialect...
    • XHTML? Never heard of it... ;-) Seriously though, you make a good point. You won't always have the option of using this design pattern (hence "whenever possible"). The predicate in the expression you wrote won't be fully resolved from MarkLogic's indexes. All that means is that some "filtering" will be required (checking inside the documents to ensure the constraint is met rather than knowing from the indexes alone). So it will work; it just won't be as automatically fast. What you do from there depends on various factors, including how many documents need to be searched. If it's a relatively small amount, then it may not be an issue at all; MarkLogic's caching will make this query much faster than if, say, you were reading and parsing XML documents off the file system. But if you're dealing with millions of documents, you'll probably want to do some content processing to ensure the relevant data is indexed.
      • Probably we should have added an htmlclass() function to avoid the need for the spaces and to make this probably very common case easier both for people to write and for optimisers. The pattern is fine otherwise for other reasons, of course.
        • I think this also works (slightly longer but a bit simpler): <code>div[tokenize(normalize-space(@class),' ') = 'widget']</code>