The following opinion takes the form of a press-release made during my time at Rubus:
XML DOCUMENT EDITING WITH WORD 2003
Microsoft Word and XML (the eXtensible Markup Language) are about to meet, despite some fundamental differences in their nature. Publishers and organisations that have document production departments should be interested in that fact that XML support in Word 2003 will be both elegant and useful, though expectations must not be set too high - this release will not quite make Word an efficient platform for XML document authoring.
The idea that XML is about documents at all will be a surprise to many. Isn't XML all about Web Services and other behind-the-scenes data exchange applications? In fact, the success of XML as a technology for exchanging data between software programs overlooks its history, in particular its derivation from older markup languages that have traditionally been targeted at the formatting or structuring of narrative texts (such as reports, manuals, books and catalogues). This older application of XML has not gone away, but has continued to see steady, unspectacular growth as publishers, along with other organisations that have serious document production departments, have seen the benefits of XML-based automation.
XML documents must, like other documents, be written by an author (unlike XML data, which is either entered into forms or automatically collected from instruments of various kinds), and long or complex documents can only be written efficiently with the assistance of word-processors. Specialised XML word-processors have existed since the late 1980's, but the big players are now starting to take notice. In particular, Microsoft, the vendor of the world's most popular word-processor, recently announced that it is going to support XML in Word 2003, which is expected to be released before the end of the year.
To those with a stake (or even just an interest) in promoting XML-based document production processes, this news was very exciting. This is due to the fact that implementing an XML-based approach has often been dismissed because it necessitates a change of tools, procedures and skills. Objections such as "we have standardised on Word and cannot use anything else" or "we can't afford to replace all our copies of Word with this other, more expensive product", or even "the authors will just not use an unfamiliar word-processor" have unarguable force. Word 2003, it seemed, was going to overcome all of these obstacles.
So, is Word 2003 all that it seemed might have been in respect to lowering the barrier to adoption of XML in document production? Not quite. It is more of a promising first step. It does not compete head-to-head with specialised XML word-processors. Put briefly, Word 2003 (if the Beta release accurately reflects the final design) is a very adept tool for adding XML tags to existing documents, but is not suitable for authoring new XML-based documents or for adding new material (along with XML tagging) to an existing document.
At this point, it is necessary to explain an important distinction between two ways in which Word 2003 can work with XML, in order to refute an argument that Word can in fact be used as an effective XML authoring package. While it is true to say that a document can be easily authored in Word 2003 then saved in XML format, the XML produced by this means simply conforms to the Word XML document model, which only describes the formatting that was applied to the document as it was created.
Those familiar with the RTF (Rich Text Format) will understand that this "default XML" is essentially RTF in XML clothes. XML that is this easy to create is inevitably less useful than "rich" XML structures that are intelligently hand-crafted for a particular purpose. In particular, it is not amenable to intelligent searching for meaningful fragments of a document (a significant letter, word, phrase, paragraph or larger unit), not already styled or otherwise marked in a unique manner, that may need to be emphasised, moved or extracted.
Of course, this default XML format is very useful. Like RTF, the saved file can be imported back into Word without loss of formatting information. In addition, Word can even be configured to save in XML format automatically, so the document author does not even have to be aware that the saved document is in XML format. Meanwhile, the saved document exists in a format that is amenable to software processing (perhaps using tools that support the XSLT standard).
Nevertheless, the distinction between XML used to describe the format of a document and XML used to describe the content of a document is very important. The real point of XML is that domain-specific document vocabularies can be created that define document types. A memo is very different to an invoice, and both are unlike a report. It is of course already possible, in Word as in most other word-processors, to develop specific stylesheets for such document types. XML simply takes the concept a significant step further, using a "schema" instead of a stylesheet to allow "custom tags" (rather than "custom styles") to be used in the document.
An XML-sensitive word-processor should be able to read any schema, then guide the author so that a document that conforms to the selected schema will be accurately produced. Such documents can be intelligently processed by similarly targeted software, so that these processes can be fully automated to improve the efficiency of production. Professional XML-based word-processors focus on the use of XML to create self-describing document fragments that are amenable to such automated processing (they might format these fragments to assist the author, but all such formatting is discarded when the file is saved).
How does Word 2003 measure-up to this more demanding use of XML? Well, it can certainly read a schema and show a list of allowed XML tags to the author. The author can select a tag from the list, applying it to a range of text (strictly speaking, this text is enclosed by a "start-tag" and an "end-tag", and the text and tags together comprise an XML "element"). Furthermore, Word 2003 can be configured to only show the tags that are allowed at a given location in the document structure, which is a significant level of control that stylesheets cannot match. The custom tags are saved along with the tags that Word uses to describe the document formatting, and therefore survive as the document is saved, closed, opened and edited over time (provided that only professional editions of Word are used throughout).
This capability will be very interesting to organisations, such as publishers, who need to enhance documents produced elsewhere (perhaps by commissioned external authors). In this scenario, document authoring is distinct from document preparation for publication. One person writes the document using standard Word features, but another person edits the document and prepares it for later processes. Where editing currently includes the application of an in-house stylesheet to the document, in future it might consist of adding XML tags to the document instead. Word 2003 arguably matches or even surpasses professional XML-based word-processors when it comes to wrapping existing text in custom XML tags.
Yet the fact remains that, due to a single omission, Word 2003 cannot be considered an effective tool for the creation of new XML documents, or for adding to existing documents. Although tags can be selected for insertion into the document, and new text can then be entered within these tags, this new text is not styled. Of course, such formatting would be irrelevant to the XML document itself, once it is saved to a file (because it is the tags that matter for later processing), but authors do benefit from the immediate visual feedback that a formatted document provides. There is also nothing to prevent styling being manually applied afterwards, but this is inherently inefficient. It should not be necessary to both tag and style the text. Professional XML word-processors automatically apply an appropriate style to the content of each new element as soon as it is inserted into the document.
This weakness in Word 2003's ability is odd. How hard could it have been to map a newly inserted tag to a style in a stylesheet, so that text entered into the tag is formatted according to that style? Performance overhead has been cited as a possible reason, but it would not be strictly necessary for Word to continue to check that the content conforms to the given style afterwards, so performance should not be an issue. Perhaps third parties will add this functionality in the short term, and Microsoft may address the issue in the longer term, at which point Word will become a true XML word-processor.
Again, to avoid a common misunderstanding that often leads to a contradictory conclusion, Word 2003 can in fact style the content of XML elements automatically, but only as an existing XML file is opened in Word. This feature does not help when creating new XML structures after the document has been opened, and by the way requires a developer with in-depth knowledge of both XSLT (an XML transformation technology) and WordML (the default XML format discussed earlier).
Even if Word 2003 is not improved to become an effective XML authoring tool (by Microsoft or by a capable third-party) it will usually have an important role to play in a sequence of document enhancement steps, provided that the emphasis is on enhancing existing documents. In a typical scenario, hundreds of external authors write documents in Word and send them to a publisher, a handful of in-house editors clean-up the text and add XML tags using Word 2003, then one or two specialists use a professional XML authoring package to add new content and finalise the XML tagging. This approach will certainly be cheaper, and less disruptive, than choosing a professional XML word-processor for all in-house editors. More significantly, it also weakens the argument for avoiding or putting-off adoption of XML.