text-box consulting ltd
Neil Bradley (neil@bradley.co.uk)
Up-Converting to XML - No Magic Bullet!
For many years now I have found that, roughly every three months or so, I am encouraged to look at yet another new data conversion tool that just might be capable of producing high quality XML data from source documents that contain little or no structure.
For structured and highly-structured material, such as SGML documents and relational database records, all of the required information will be present, and fully-automated conversions will therefore be possible. But for unstructured documents, such as word-processor and DTP files, there has never been a magic bullet (or data conversion pixie).
And if one conversion tool cannot do the job, then chaining together several such tools can only make things worse rather than better. For example, when told that conversion from uncontrolled Quark XPress documents to XML will be far from perfect, it may be suggested that a lauded PDF-to-XML product could be introduced into the process, and converting the source Quark documents to PDF format as the first step. This particular scenario will only make the situation worse, because the intended flow of the narrative may be lost when the source documents uses multiple columns, text boxes and other complex page layouts.
We just have to accept that the benefits of high quality XML data comes at a cost. That means semi-automated processes, at best. This is truly a case of "no pain, no gain".
text-box consulting ltd