Print friendly
SGML, HTML and XML - Confused?
In June 1998, Document World published SGML, HTML and XML - Confused? (vol 3, iss. 3). The text of this article has been reproduced below. Looking back over six years later, there is little to be embarrassed about in this article today, except for the expectation that SGML would continue to play some role, rather than be demolished by XML, as it quickly was.
Article Content
SGML, HTML and XML all end with the same two letters, 'M' and 'L', which stand for Markup Language. Text documents are marked up with tags to describe the format or meaning of the content. Traditionally, this markup has been focussed on the required appearance of the published text. The language referred to in such cases comprises the syntax of the tags and a list of the keywords and parameters needed to specify the format of a document.
In 1986, the International Standards Organisation (ISO) released the Standard Generalized Markup Language (SGML) - a vendor and platform neutral format. The language part of SGML is a set of commands for defining application specific tag-sets, creating a Document Type Definition (DTD). The DTD states, amongst other things, which tags are allowed, where they may be used and which are optional or repeatable. Industry-wide and organisation specific DTD templates have been defined for technical manuals, reference books, journal articles, patents, medical records and many others.
In 1991 a document linking service was added to the Internet. Every Web document is coded in the HyperText Markup Language (HTML). This markup language includes one tag to support the hypertext linking functionality, and a number of others to format headings, paragraphs, lists and tables.
The advent of multiple publishing media, including CD-ROM and the Internet, coupled with the increasing need to re-package information for niche markets, has put a huge strain on the traditional approach. Publishers have found that there are substantial costs associated with re-formatting material for each medium and target audience. In order to extract, manipulate and format material automatically, a different approach is required. The document must be 'self-describing'. This means that every identifiable fragment of the information is tagged by name, not by its intended appearance in one output media. With this approach, stylesheets are needed to specify output formats, and new stylesheet can easily be developed for each audience and publishing medium.
Two problems have emerged. First, SGML is a large, complex standard, with many optional features, some rarely if ever used. Second, HTML has no capability to add self-describing, customised tags. The solution was a combination of simplicity and power - the eXtensible Markup Language (XML), which was launched on February 10th 1998. XML superficially resembles HTML, but incorporates the core features of SGML. Beyond the publishing sector, it is seen as an ideal format for the exchange of rich data between systems and databases, and has already been adopted by Microsoft in its Web publishing technology.
So, is there room for all three languages? Some organisations will continue to use SGML for core data storage and management, converting to XML or HTML for output. Small, marketing-oriented Web sites have no reason to drop HTML. Others will adopt XML, perhaps converting to HTML for Internet publishing until Web browsers become fully XML-aware. In the longer term, I believe that the distinctions will disappear as XML broadens its scope and harmonises with a future version of SGML, and HTML becomes the default application of this new language.