Document formats, part 3: Changing formats
I ended the last installment with this question: What happens if you load a bunch of HTML documents as "text" and then realize you want to query them as XML (using the full power of XPath and XQuery)? In this article, I'll show you exactly how I solved this problem (and how you can too).
Before I go into the solution I used, it's
worth mentioning the simplest approach. Simply re-load all the
documents, but this time, use the format option you want
(<format>xml</format>).
xdmp:document-load() will replace the documents you loaded the
first time, exchanging the old text documents for the new,
desirable XML documents. That's the approach I'd generally
recommend.
But what happens if you don't have the original documents around? Or you don't know where they came from? Or they came from disparate sources and it's been a long time since you loaded them? Wouldn't it be nice to just flip a switch and tell MarkLogic Server to turn these text documents into XML documents? Well, it's not quite as simple as that (because not all text can be trivially converted to XML), but there is a way.
The key thing to remember is that a document's format is not some property that's external to the document itself. The document's format is an emergent property of its content, if you will. In other words, when we talk about a document's format, all we're really talking about is what kind of node the root document node contains: an element node, a text node, or a binary node. If we want to convert a text document into an XML document, we need to parse the string-value of the text node as XML and replace the existing document with the newly parsed element tree.
In my case, all the docs I had loaded were in the same directory, but not all of those files were HTML docs. The first thing I did was list all the file extensions that appear in that directory, using this query:
This yielded the following list:
From a quick look here, I could see that the only files I was interested in were the ".html" files, so I could constrain my relevant $docs down further:
The next step was to figure out how many of these were already XML and how many were still just text documents:
This showed that all of them were still text documents:
Thus, my final goal was to get "XML" to say 651 and "Text" to say 0.
If the text of all of these documents happened to contain well-formed XML, then all I needed to do was parse them using xdmp:unquote(), and replace the existing document with the newly parsed XML document. But before I did that, I wanted to make sure this was going to work:
This yielded an error, complaining about the presence of an undefined entity reference ( ) in one of my HTML docs. So I knew I was going to need some clean-up or repair. Luckily, MarkLogic Server provides a lot of tools for doing just that. First of all, xdmp:unquote() takes a list of options in its third argument (the second argument is a default namespace you can have applied to the result):
The "repair-full" option doesn't repair all kinds of potential well-formedness errors, but it does a good job with things like detecting missing end tags and inserting them as necessary. Here's the final script I used:
Basically, it attempts to parse the text as is. Otherwise, if that fails, it tries repairing it using the "repair-full" option. It then takes the parsed result and stores it in the database at the same URI, replacing the original.
WARNING: Be sure to back up your documents before trying anything like this (programmatically replacing documents). In practice, this involved a lot of trial and error before I no longer had any issues, and on more than one occasion I ended up inadvertently inserting an empty document.
The above script was sufficient for some of the directories I needed to update. But I ran into some other problems in other directories, where I had to do some calls to fn:replace() before calling xdmp:unquote(). Another option is to use xdmp:tidy() which provides all the power of the popular HTML Tidy tool, for converting HTML to well-formed XML.
After running the final version of my script, I tested the results to make sure they were all XML now, and no text documents were left:
That's exactly the output I wanted to see! All 651 docs were converted from text to XML.
Comments