InCopy XML Converter
The InCopy XML Converter is a Java class that converts InCopy documents into clean, simple, accessible XML documents (see below for a link to download a trial version). It is designed to work in an automated, batch processing environment, and works on any InCopy document, but provides most value with documents that have had stylesheets applied to them.
This tool also works for InDesign documents, but requires each story to be saved as an InCopy file (though InCopy itself is not needed). Even if it were not necessary, this would often be a reasonable task to perform because it isolates and provides individual file names for each articles within a complex, multi-article page design.
Why is it Needed?
This tool should not be needed because InCopy documents are already XML files. Also, both InCopy and InDesign include the ability to manually add custom XML tags to documents then export a pure, content-oriented XML output file. However, there are problems wth both of these solutions:
- Although native InCopy documents (".incx" files) are already XML documents, the XML tags in those documents do not capture the document structure. While converting InCopy documents to other XML formats is theoretically possible using XSLT, in practice XSLT is just not up to the task of interrogating the source data at the character level required. A genuine programming language is required (hence the need for this Java tool).
- The XML export feature takes time and effort to use and has limitations (some of them severe).
The limitations of the built-in XML options and the strengths of this new approach are discussed in more detail below.
Limitations of InCopy XML Format
An InCopy file is already an XML document. If you open an InCopy file in a text editor, it is clearly seen to be an XML file. With other products that have a native XML format, it is usually a simple matter to convert the files to another, product-specific XML format, using nothing more than an XSLT stylesheet. However, the XML elements are not used to identify document structures, but only to identify the start and end of each range of styled text. Special characters identify paragraph breaks. This makes it very difficult to create a sensible XML output structure. For example, consider the following InCopy XML fragment:
<txsr prst="o_uc5" crst="o_u5c"><pcnt>c_This is a normal but non-indented paragraph, as usual immediately after a heading. There follows a random (bulleted) list.?</pcnt> </txsr>
<txsr prst="o_uca" crst="o_u5c"> <pcnt>c_first bullet item?second bullet item?third bullet item?</pcnt> </txsr>
<txsr prst="o_uc3" crst="o_u5c"> <pcnt>c_Normal paragraph.?The following list has two paragraphs in the middle item:?</pcnt> </txsr>
<txsr prst="o_uca" crst="o_u5c"> <pcnt>c_list item consisting of a single paragraph?list item consisting of two paragraphs?</pcnt> </txsr>
<txsr prst="o_uc4" crst="o_u5c"> <pcnt>c_Second paragraph in bullet list item?</pcnt> </txsr>
<txsr prst="o_uca" crst="o_u5c"> <pcnt>c_final item in list?</pcnt> </txsr>
<txsr prst="o_uc3" crst="o_u5c"> <pcnt>c_This is a normal paragraph.?</pcnt>
Try to see how the fragment above maps to the following output of the conversion tool (it is far from obvious):
<Para Class='Paragraph Not Indented'>This is a normal but non-indented paragraph, as usual immediately after a heading. There follows a random (bulleted) list.</Para>
<Para Class='List Bullet'>first bullet item</Para>
<Para Class='List Bullet'>second bullet item</Para>
<Para Class='List Bullet'>third bullet item</Para>
<Para Class='Paragraph'>Normal paragraph.</Para>
<Para Class='Paragraph'>The following list has two paragraphs in the middle item:</Para>
<Para Class='List Bullet'>list item consisting of a single paragraph</Para>
<Para Class='List Bullet'>list item consisting of two paragraphs</Para>
<Para Class='Paragraph In List'>Second paragraph in bullet list item</Para>
<Para Class='List Bullet'>final item in list</Para>
<Para Class='Paragraph'>This is a normal paragraph.</Para>
Limitations of InCopy XML Export Feature
Due to the limitations of the native InCopy XML format discussed above, InCopy (and InDesign) can also export content as clean and simple XML, but the InCopy/InDesign author has to follow a tedious procedure to achieve this. The author has to:
- import the tags (form file) if not already in the template
- select the text to be tagged
- either select "Map Styles To Tags" (but only if the names of styles and elements match, which is convenient but creates unfriendly styles, such as "ParaIndented" instead of the preferable "Paragraph Indented"), or individually select each paragraph and then the tag to be applied to it (which is very slow)
- select the root element
- select "Export XML"
- name the XML file to be created
There are also tagging limitations, including:
- no footnote tagging (in fact, due to a bug, exporting fails if footnotes are even present)
- tables are not in XHTML format (they are in a very hard to decypher format, especially if cell spanning is used within a table)
- image references are not in XHTML format
- lists are not identified and wrapped in list-encompassing tags
- list items cannot contain more than a single text block (secondary and embedded lists are not recognised)
Benefits of InCopy Converter
The InCopy converter program was created to provide an approach that:
- is fast and efficient to use (hands off batch conversions)
- reliably handles tables, complex lists, image references and footnotes
- includes a customisation feature that removes the need for post-processing of the XML output
InCopy Converter is a single Java Class file that performs the task or reading an InCopy XML file and outputting a "clean" new XML file.
Of course, any competent programmer can reproduce the task performed by this tool. But note that parsing of InCopy XML is not a trivial exercise. In particular, due to the way that InCopy separately stores image references and tables from the main text, it is far from trivial to handle content that includes such things as images within table cells, or entire tables within table cells. The bizarre way that inline styles are marked also makes the formatting of table cell contents hard to interpret. Lack of source documentation also makes reverse engineering of these complexities necessary.
Generic XML Option
The simplest way to use the InCopy Converter tool is to convert documents using the default "generic XML" format, and this approach is very appropriate if the XML output is expected to require further extensive processing (using XSLT or other technology). Documentation for the tool describes each of the output elements produced when using this mode.
Simple text blocks (titles, headings and paragraphs) are converted to <Para> elements, with Class attributes that reflect the original style name from the InCopy stylesheet.
Tables are converted to XHTML table tagging.
Lists are identified, and properly wrapped in list and list item elements, by identification of paragraph styles that represent list items and paragraphs within list items.
Image references are also preserved.
Custom XML Option
If a post-processing step is only envisaged because of the need for custom XML output, then an alternaive approach can be used to eliminate the need for that further step.
A reference to an XML configuration file can be passed to the tool. The configuration file specifies the XML tagging to be used for each paragraph style, named character style, basic character style (such as bold or italic), and list structures, table structures and footnotes.
With this approach, there is often no need for a subsequent processing step.
The InCopy Converter ZIP file (only 62k) can be downloaded:
The ZIP file includes the InCopyConverter class file, along with Word documentation. Assuming that configurable output might be of interest, it also includes the configuration DTD and a sample configuration file. XMetaL authoring configuration files can be supplied on request.
Warning: the class file is a trial version of the software that is fully functional but scrambles the document text. The text should look OK from a distance, but individual words will be obviously corrupted on closer inspection. All of the output tagging will be correct, except for any metadata tagging embedded in original paragraphs, which only becomes 'true' XML once it has been processed, so is treated as normal text during the scrambling process.