Print friendly
The Pros and Cons of XML For Documents
In May 2006, ei Magazine published The Pros and Cons of XML For Documents (vol 2, iss. 10). The text of this article has been reproduced below.
Introduction
The XML data format encodes documents in a way that makes them amenable to efficient and flexible exploitation through automated processes (such as extraction and re-use of titles to build a contents list). But not all documents are suitable for representation in XML. Some of the potential benefits introduce complexities and costs that need to be understood in order to assess the prose and cons of possible XML adoption. In particular, some content management systems offer advanced capabilities for XML documents, but what are these features, and does your business case justify their cost?
Why XML For Documents?
The original purpose of XML, and SGML before it, was to divide the content of documents into meaningful sub-components, arranged in sequential and hierarchical structures that could be easily explored, extracted, re-ordered and formatted for delivery to a variety of publishing media and audiences. Unlike formatting-oriented languages, such as HTML, there are no pre-defined tags. Instead, a set of appropriate tags is devised for each document type, and these tags focus on meaning instead of formatting, thus facilitating intelligent querying of the content. Figure 1 shows how a simple memo might be encoded using XML tags, where <from> and <to> tags unambiguously identify the names of the sender and the recipient.
While not all documents are suitable for XML encoding, those potentially able conform to tightly controlled narrative structure templates are certainly good candidates. For example, XML is eminently suitable for such document types as reports, journal articles and reference books, user guides and training manuals, and less obviously even for poetry, concert programmes, long forms, simple catalogues, and many other types of 'document'.
Unfortunately, the enthusiasm with which XML has been adopted as a powerful data format for carrying highly structured data between databases and software applications has tended to overshadow these document-focused origins. Only with the recent release of word-processors that include 'save as XML' options has its original purpose begun to reassert itself. Yet even that step has promoted a misleading account of the purpose of XML. Microsoft Word, for example, uses XML as a replacement for its RTF format, and it therefore includes XML tags for each of the styles that Word can produce. In contrast, Figure 1 showed the use of <emph> tags, rather than explicit formatting tags such as <bold> and <underline>, because the decision on how to render the content can (and often should) be made later. More significant would be the need to distinguish between, say, foreign words, proper names and emphasised text. XML can do this easily, as shown in Figure 2, regardless of whether or not the content of the different XML tags will ultimately be displayed in the same italic style.
Briefly, some of the other benefits of XML include:
  • open international standard - this is often a requirement, especially in government organisations, and the choice of tools from different vendors avoids dangerous vendor 'lock-in'
  • maturity - XML was born in 1998 to its parent, SGML (1986), which emerged from theories going way back to the 60's
  • popularity - XML has replaced SGML, and there are no other competing data formats with the same goals, so there are lots of books, tools and trained software developers to choose from
  • supporting standards - XML is supported by other standards, such as XSLT for converting XML to other formats for manipulation and presentation
When to Ignore XML
Even the most dedicated XML evangelist will admit that XML is not always the most appropriate solution to every problem. Indeed, there are times when it should not even be considered.
As indicated above, XML is most suited to structured narrative content, consisting of unpredictable sequences of such text structures as paragraphs, lists and tables. More highly structured data has a natural home in database systems. At the other extreme, highly designed material, such as a children's picture book or a brochure (typically created using a design-oriented DTP package like InDesign or QuarkXPress), may have no obvious structure at all. But note that, in both scenarios, XML may still play secondary roles in data capture and publishing processes.
Even when the documents are structurally suitable, if the value of the content is very low, and quality is not a goal, then the cost of XML implementation is unlikely to be worthwhile. In particular, short-lived content that will never need to be re-worked for secondary media or alternative audiences rarely warrants the extra costs associated with XML.
XML Document Models
Automated processing of XML documents can only work reliably if all of the documents are tagged in a consistent and predictable way. This requires a document model to be created, using a DTD (Document Type Definition) or a schema (the W3C standard or one of its competitors), and for each document to then be tested against that model. The big decision is whether to simply adopt a suitable industry standard model (if available), and perhaps tailor it to your needs, or to start from scratch with a new model crafted specifically for your document characteristics and for your functional requirements.
It is also very important to get the model right first time. Changes made to the model after starting to create conformant XML documents can be expensive because the changes often affect the document tagging too.
A happy and important side effect of having a document model is that, when using an XML-sensitive word-processor, the model acts as an advanced template that authors are forced to comply with (though they may need to be placated by emphasising that the model merely 'guides' them). This enforced consistency automatically raises the perceived quality of the product.
Converting Content to XMLFormat
The decision to adopt XML often comes with an immediate headache: how to convert masses of existing material into XML format. Different strategies are available, depending upon the nature of the source content, the scale of the task, and any security and schedule requirements.
The problem with most other data formats is that they are not as tightly structured as XML, and therefore cannot be reliably converted to XML using fully automated processes. Costly manual checking, correction and enhancement steps are almost inevitable. It can be cheaper to hand over all the existing data to an offshore data conversion bureau (assuming there are no data security issues), but then it is important to ensure that quality standards are maintained through a water-tight service agreement that takes into account tagging and content quality rates, along with implementation of an effective sample checking process to ensure conformance with those quality standards.
New documents may be created in one of the popular word-processors, then converted to XML, either using a 'save as XML' option, if available, or by use of a specialised batch-processing conversion tool, which may be better at image handling (extracting them and creating references to them in the XML). An XSLT engine is then typically used to convert from the output document model to the required document model. Depending upon the complexity of the document model, it may then be necessary to use an XML-sensitive word-processor (see below) to correct the occasional mistake, or to add structures that could not be represented within the original word-processor.
Creating Content in XML Format
It is usually much easier to originate content in XML format, rather than create it in another format and convert to XML later.
There are many XML-sensitive authoring tools to choose from, though only the professional XML word-processors should be considered for non-technical authors, as cheaper tools that are aimed at XML developers are not sufficiently intuitive, and would certainly meet justifiable author resistance. Some of these word-processors hide the XML from authors in an attempt to simplify the authoring experience, but (in my opinion) cause more user interface problems than they solve. Others take the sensible approach of bringing XML to the fore, as in the example screenshot in Figure 3. Of course, author training is required, but this generally takes no more than one day of instruction and practice.
Most of the high-end document management systems have integrated one or more of the best of these word-processors (such as Epic editor or XMetaL), though these word-processors also work as stand-alone tools, so are also ideal for remote, offline authoring.
However, XML word-processors are also relatively expensive. If the cost of purchasing them (along with authors training costs) would be too high, then the alternative strategy would be to split authoring into two steps. Authoring can be done in any popular word-processor, then the content converted to XML (as discussed above), and completed by a small team of specially trained editors using an XML word-processor.
Content Management Issues
Unsurprisingly, a CMS does not have to be aware that it is handling an XML document for it to be able to offer the basic features of secure storage, workflow control, search and other retrieval options. But some content management systems are able to detect XML documents and offer advanced XML-specific functionality. So, what are the features to consider when shopping for a new CMS? It depends on the detailed business requirements, of course, but there follows some of the factors to consider when creating and archiving XML documents within a CMS.
Editorial System Factors
An XML document that claims to conform to a specific document model might have this claim tested as it is added to the content management system, and thereafter each time it is checked-in after amendment. If the document fails the validity check, it might be rejected or highlighted for further attention. To do this, the CMS must be able to read the XML file to find the information it needs to identify the appropriate document model, then find and use that model in order to validate the document.
A similar issue arises if the CMS is configured to launch an XML word-processor whenever an XML document is checked-out. The CMS will pass the document to the word-processor, which will want to validate the document before displaying it to the author. If the required DTD or schema is also managed by the CMS, then the CMS will need to know, and be able to copy-out the latest version of the model to the location that the word-processor expects to find it.
There are also features of some XML documents that challenge content management systems. In particular, the fact that XML-based documents are not always single data files. For example, image data is usually held in external data files that conform to a non-XML-based image data format (such as EPS, TIFF or GIF). Similarly, an XML document may contain references to other XML documents that are to be merged into the main text. A complete document might therefore be composed of a combination of files, and for the sake of operational simplicity this bundle may need to be managed as a single object. This might include automatic check-out or copy-out of images or sub-document files referenced from an XML document that is being checked-out.
Some content management systems allow an XML document to be 'shredded'. In its simplest form, this is a single-level process. For example, an XML document representing a book might be split into chunks at the chapter level. This is very useful if parts of a large document are regularly updated and authors want to work simultaneously on different sized chunks of the document. There are a few document management systems on the market that focus on XML (and SGML) content to provide all or most of the features mentioned here, and in addition provide a more sophisticated version of the shredding feature, which is extended to all levels of the document structure hierarchy so that components at all levels can be versioned, locked, checked-out, and even shared with other documents (an edit made to the shared components automatically updates all of the documents that reference it).
All content management systems maintain meta-data about their content items. This typically includes the document title, author name, date and time of creation, and any classification keywords. A CMS may allow mapping of meta-data to XML structures as an XML document is checked-out and checked-in. The benefits are avoidance of data duplication, and the simplicity of single-interface authoring of both data and meta-data.
Although most content management systems allow specialised content editing tools to be either integrated or launched externally, they also typically have a built-in content editor. This would be a cheaper option. A CMS may offer a web-based form for creating new content, and may even stress the fact that the content will be stored in XML format. However, the XML generated will usually conform to a standard, very restricted model. It may not, for example, allow for a narrative structure that includes mixed sequences of paragraphs and lists, or inline tagging beyond simple bold/italic/underline styling. It is very rare for a built-in authoring system to rival the flexibility of true XML word-processors. Any built-in editor should therefore be tested for suitability against the document model.
Archive Retrieval Issues
XML data can be searched, like text-based documents, or queried, like database records, using languages similar to SQL. This dichotomy of access techniques reflects XML's nature as an intermediary between uncontrolled text and highly structured data.
Search technologies are invaluable tools for finding specific documents within a large archive. Of course, being text-based an XML document can be indexed by any search engine. However, the structural nature of XML documents allows for more refined searches to be performed. Some search technologies allow for 'zoning' of XML document content, where each zone represents the content of a specific pair of tags. The context within which a word or phrase is found can then be taken into consideration. For example, it becomes possible to find documents that contain an important word or phrase, but only when it appears in a summary or within titles.
It can be useful to build new documents automatically from components of other documents. For example, a catalogue containing titles and summaries of the document archive might be needed. There are simple ways to achieve this when working with small collections of documents, using nothing more than a batch process with an XSLT stylesheet at its heart. But this approach would be inadequate for large collections of documents. This is where an XML database comes into its own, with its advanced retrieval features (typically based on the XPath standard, or on the more advanced XQuery language).
Finally, regardless of how it is found, the content of an XML document is not directly suitable for direct display to anyone but an XML geek. The XML tags therefore need to be replaced by suitable formatting of their contents. A CMS may provide the means to generate a preview version of the document, often in HTML or PDF format (typically using a basic XSL-FO engine, for which a suitable XSLT stylesheet would need to be developed).
Conclusion
Hopefully, it has been shown that if the cons of XML are complexity and cost of setup, then the pros are quality, and the low cost and high speed of information reuse. Finally, while adoption of XML can bring economic and quality benefits, this can only be achieved if care is taken to implement XML with due consideration to the potential pitfalls. In particular, does the business case support the adoption of XML, is the potential document model appropriate, and have the most appropriate supporting tools been chosen?