XML Tutorial
Lesson One: The History of XML

Before there was electronic word processing, manuscripts awaiting publication would generally be commented with formatting instructions for the typographer. This was called "markup".

2K - Markup
 
These days however, markup is generally done electronically 'on the fly' as you create the document on your computer using a word processing application. This is called "procedural markup" because it is created electronically rather than written on manuscript copy.

Anytime you use your word processor's formatting tools to bold a word or indent a paragraph you are performing procedural markup. That is, you are adding format instructions to the plain text. These instructions, stored as special codes with the text they affect, tell the output device (monitor or printer) whether to display the text in Arial, or Times Roman fonts, whether to bold words, or center headings, etc.

Before you decide what formatting to apply to your text, you first analyze the structure of your document. Is it a formal business letter? Is it a presentation, a user training manual, or simply a memo? Each of these documents have certain generally accepted 'configurations' and within those configurations or design structures are "style elements".

For example, a user training manual generally has a Title Page, a Table of Contents, Chapters, Glossary, etc. Within each larger element are subelements. A Chapter element may contain the following subelements: heading levels 1-5, body text, indented body text, etc.

The problem with procedural markup is that it is usually "system dependent". If you move the document from one platform to another, importing it into a different word processor, for instance, generally all the formatting is lost. It is also highly dependent on the ability of the author to understand the structure of the document and apply markup consistently.


Here is a great example of Style Tags (or "procedural markup" codes) for book design. On the top right of their page is a link to a downloadable PDF listing their styles tags in a nice columnar format. To download the file, right click on the link and choose "save to file".

As a natural progression, the limitations of procedural markup were addressed by 'generic code' such as used by TeX. TeX is a language used in a variety of typesetting environments. It uses embedded codes or 'tags' within the text of the document to initiate changes in layout including the ability to describe elaborate scientific formulas. Formatting rules are then associated with the tags.

Quark Xpress, a popular page layout program, uses Quark tags which are another example of a generic code language. Because Quark Xpress is such a popular electronic publishing application, a number of 'after-market' conversion tools have been developed for it. Currently there are several versions of Quark Tags to HTML, Quark Tags for Database conversion. It's almost as though Quark were itself a precursor to XML. About five years ago, Aunty herself wrote a Foxpro to Quark Tag conversion program for Yellowpage ads and telephone listings, at a tidy profit, thank you!

Generic code is more flexible than procedural markup because to change a document's appearance all you have to do is change the macro containing the code. The code changes are then automatically reflected in the document. Generic code languages are also generally more portable from one platform to another. For example, you can create a document and format it with Quark Xpress tags in any word processing application and then export the file as ASCII text into the Quark Xpress application on Windows or Macintosh platforms.

SGML or Standard Generalized Markup Language was the natural successor to 'generic code' languages. SGML is a standard (ISO 8879) published in 1986 by the International Standard Organization (ISO) based on concepts developed at IBM. The difference between 'generic coding' and SGML is that generic coding depicts a document's appearance whereas SGML describes a document's structure (a memo, a presentation, a user manual, etc.) for defining the formatting in a text document. SGML also adheres to a model which is similar to a database schema. This means that it can be stored in a database or processed by software designed to interpret the model.

SGML is a comprehensive language that can even define hypertext links. (If you don't know what hypertext links are you need to take a web beginners' class, try this one).

In order to decipher format commands in an SGML document, SGML uses format definitions in a separately-created Document Type Definition (DTD) file, often called an "SGML application". Or to put it another way, the document structure is written in a DTD. A DTD specifies a set of elements, their relationships, and the tag set to mark the document. Thus the markup code within the document is following a 'model' described in the DTD. As a result, SGML is often called a metalanguage. A metalanguage 'describes another language'; in SGML's case, the actual formatting commands that are embedded in the text.

SGML itself doesn't impose a structure on documents but some document structures, such as HTML, are maintained by ISO standards committees as public standards in the form of SGML DTDs.

HTML (HyperText Markup Language) was originally developed to be a standard for defining hypertext links between documents. It is frequently described as "a subset of SGML" but it was originally designed as a separate and distinct markup language, in the style of SGML, for electronic document management by Timothy Berners-Lee when he was working as a consultant for the CERN Institute in Switzerland.

About five years later, HTML was defined as an SGML DTD as a markup language for Web documents. HTML has changed over the years. It's no longer used just to define the structure of a document the way SGML does. These days it's also used as a graphic design tool to display the document. Further, with the addition of style sheets and class attributes, it's grown into a true procedural coding language, but it's still too inflexible and inefficient to handle the formatting changes required of even a midsize business website, for example.

That's why XML (eXtensible Markup Language) was developed by the W3C (World Wide Web Consortium). XML is a scaled-down form of SGML. Extensible means "capable of being expanded or customized". It's a document processing standard specifically designed to overcome the limitations in HTML. Unlike HTML, XML has no predefined tags. Instead, it lets designers create and format their own document markup tags. HTML tags, for example: <OL> <BODY>, are static (with a fixed definition in the HTML standard). XML tags are configurable according to their creator's preference. For example: <Main-Heading> <BillboardFont> are XML tags that are defined through their creator's own DTD and stylesheets and then applied to as many XML documents as desired.

XML USES

There are two classifications of XML applications:

  1. Documents that contain information for users
  2. Documents that contain information for database manipulation

The uses of XML in the publishing industry are fairly obvious. However, XML will also make the very complicated world of EDI much simpler because of its 'generic semantics' which can easily be changed to fit the circumstances. EDI stands for "Electronic Data Interchange" and is used today to automate business-to-business document flow (purchase orders, invoices, inventory tracking, etc.) We'll cover both uses as we continue with this tutorial.



Next >>

Return to: Aunty Violet's Advice for the PC Impaired Home Page

Copyright © 1998 – 2000 Violet Weed, Inc. – Microsoft / Novell / Oracle / Sun Certified