Loading...
 
Search icon Looking for something?


Structured Authoring and XML: Part 1 of 3
Published
2003, Q3 (February 21, 2007)
By Sarah O'Keefe

Read Part Two and Part Three

Implementing structured authoring with XML allows organizations to create better content. The addition of hierarchy and metadata to content improves reuse and content management. These benefits, however, must be weighed against the time and money required to implement a structured authoring approach. The business case is compelling for larger writing organizations; they will be the first to adopt structured authoring. Over time, improvements in available tools will reduce the cost of implementing structured authoring and make it affordable for smaller organizations.

This article provides an introduction to structured authoring and XML. Future articles will cover topics such as the impact of structured authoring on a publishing workflow, workflow options, developing a business case for structured authoring and XML, and determining whether your organization needs structure.

What is structured authoring?

Structured authoring is a publishing workflow that lets you define and enforce consistent organization of information in documents, whether printed or online. In traditional publishing, content rules are captured in a style guide and enforced by (human) editors, who read the information and verify that it conforms to the approved style.

A few simple examples of content rules are as follows:
  • A heading must be followed by an introductory paragraph.
  • A bulleted list must contain at least two items.
  • A graphic must have a caption.

In structured authoring, these rules are captured in a structure definition document. Authors work in software that validates their documents; the software verifies that the documents they create conform to the rules in the structure definition document. Consider, for example, a simple structured document-a recipe. A typical recipe requires several components: a name, a list of ingredients, and instructions. The style guide for a particular cookbook states that the list of ingredients should always precede the instructions. In an unstructured authoring environment, the cookbook editor must review the recipes to ensure that the author has complied with the style guideline. In a structured environment, the recipe structure requires the specified organization.

Elements and hierarchy

Structured authoring is based on elements. An element is a unit of content; it can contain text or other elements. You can view the hierarchy of elements inside other elements as a set of nodes and branches.

Elements can be organized in hierarchical trees. In a recipe, the ingredient list can be broken down into ingredients, which in turn contain items, quantities, and preparation methods, as shown in Figure 1.
Figure 1. Recipe hierarchy
Figure 1: Recipe hierarchy


The element hierarchy allows you to associate related information explicitly. The structure specifies that the IngredientList element is a child of the Recipe element. The IngredientList element contains Ingredient elements, and each Ingredient element contains two or three child elements (Item, Quantity, and optionally Preparation). In an unstructured, formatted document, these relationships are implied by the typography, but unstructured publishing software (word processors or desktop publishing tools) does not capture the actual relationship.

In structured documents, the following terms denote hierarchical relationships:
  • Tree-The hierarchical order of elements.
  • Branch-A section of the hierarchical tree.
  • Leaf-An element with no descendant elements. Name, for example, is a leaf element in Figure 1.
  • Parent/child-A child element is one level lower in the hierarchy than its parent. In Figure 1, Name, IngredientList, and Instructions are all children of Recipe. Conversely, Recipe is the parent of Name, IngredientList, and Instructions.
  • Sibling-Elements are siblings when they are at the same level in the hierarchy and have the same parent element. Item, Quantity, and Preparation are siblings.

Element attributes

You can store additional information about the elements in attributes. An attribute is a name-value pair that is associated with a particular element. In the recipe example, attributes might be used in the top-level Recipe element to provide additional information about the recipe, such as the author and cuisine type (Figure 2).
Figure 2. Attributes capture additional information about an element
Figure 2: Attributes capture additional information about an element


Attributes provide a way of further classifying information. If each recipe has a cuisine assigned, you could easily locate all Greek recipes by searching for the attribute. Without attributes, this information would not be available in the document. To sort recipes by cuisine in an unstructured document, a culinary expert would need to read each recipe.

Formatting structured documents

To format structured documents, you associate formatting with particular elements or element sequences. Such formatting is usually highly automated; once an author assigns elements to content, the formatting is implemented automatically to create the final output files.

What is XML?

Extensible Markup Language (XML) defines a standard for storing structured content in text files. The standard is maintained by the World Wide Web Consortium (W3C).See note 1.

XML is closely related to other markup languages, such as Standard Generalized Markup Language (SGML). Implementing SGML is an enormous undertaking. Because of this, SGML's acceptance has been limited to industries producing large volumes of highly structured information (for example, aerospace, telecommunications, and government).

XML is a simplified form of SGML that's designed to be easier to implement. See Note 2. It's possible to build very complex authoring systems based on XML, but early implementations of XML have been mostly lightweight datainterchange applications.

These relatively small applications have given XML technology critical mass and momentum. Now, XML-based authoring is beginning to move into publishing. In data exchange, structure definitions were needed to describe invoices, inventory, and the like. Publishing applications require much more complex frameworks to represent the structure of technical documents, including parts catalogs, training manuals, reports, and user guides.

XML syntax

XML is a markup language, which means that content is enclosed by tags. In XML, element tags are enclosed in angle brackets:

<element>This is element text.</element>

A closing tag is indicated by a forward slash in front of the element name.

Attributes are stored inside the element tags:

<element my_attribute="my_value">This is element text.</element>

XML does not provide a set of predefined tags. Instead, you define your own tags and the relationships among the tags. This makes it possible to define and implement a content structure that matches the requirements of your information. Figure 3 shows an XML file that contains a recipe.
Figure 3: A recipe in XML
Figure 3: A recipe in XML


XML is said to be well-formed when basic tagging rules are followed. For example:
  • All opening elements have a corresponding closing element, and empty elements use a terminating slash:
    <element>This element has content</element> <empty_element />
  • Attribute information is enclosed in double quotes:
    -+<element attribute="name">This is a legal attribute</element>+-
    -+<element attribute=name>This is not wellformed.</ element>+-
  • Tags are nested and do not "cross over" each other:

<element>This is <strong> correct. </strong></element>

<element>This is <strong> not correct. </element></strong>

XML is said to be valid when the structure of the XML matches the structure specified in the structure definition. When the structure does not match, the XML file is invalid (Figure 4).
Figure 4: Invalid structure
Figure 4: Invalid structure


Entities

An XML entity is a placeholder. Entities allow you to reuse information; for example, you could define an entity for a copyright statement:

<!ENTITY copyright "Copyright 2002 Scriptorium Publishing Services, Inc. All rights reserved.">

To reference the entity, you refer to the entity name:

&copyright;

The entity text is displayed instead of the entity name:

Copyright 2002 Scriptorium Publishing Services, Inc. All rights reserved.

Storing common information in entities lets you make a change in one location (the entity definition) and have the change show up everywhere that references the entity.

Entities are also used to include information that can't be easily rendered as text. Graphics, for example, are usually referenced as entities. In the following example, the entity definition contains the entity name, graphic file name, and file type:

<!ENTITY my_image SYSTEM "image.gif" NDATA gif>

In the XML file, a Graphic element references this entity:

<Graphic entity = "my_image" />

Structured authoring is a concept. XML is a technology (or, more precisely, a specification) that lets you implement structured authoring using plain text files.

Most structured authoring implementations that are already in place use SGML; most of those now under development appear to be XML-based. The terms XML and structured authoring are often used almost interchangeably.

Unlike SGML, XML has found wide acceptance outside the technical publishing world. XML is quickly becoming the standard for data interchange and web services applications.

Defining structure in XML

In XML, you define your structure using either a document type definition (DTD) or a schema. In either case, you specify elements and how they are related to each other. For example, a Recipe element definition might read as follows in a DTD:

<!ELEMENT Recipe (Name, History?, IngredientList, Instructions)>

In an XML schema, the definition is itself an XML document. For the Recipe element, a simplified Recipe definition would read as follows:

<xsd:complexType name="Recipe">
<xsd:sequence>
<xsd:element name="Name"type="xsd:string"/>
<xsd:element name="History"type="xsd:string"minOccurs="0" maxOccurs="1" />
<xsd:element name="IngredientList"type="xsd:string"/>
<xsd:element name="Instructions"type="xsd:string"/>
</xsd:sequence>
<xsd:complexType>

Once you define the structure in either a DTD or a schema, authors create documents that comply with the structure. At a bare minimum, this allows you to specify, for instance, that the list of ingredients in a recipe must occur before the instructions.

DTDs were first created for SGML and have been available for quite some time. Schema are new in XML; the specification for schema has not yet been finalized, and it's possible that it could still change. Currently, the number of schema-aware applications is limited. The main advantage in schema is that you can control the content allowed inside an element. You could, for example, define a zip code element that requires a number:

<xsd:element name="zip" type="xsd:decimal"/>

DTDs do not provide you with control over the information in an element. An equivalent DTD statement would read as follows:

<!ELEMENT zip (#PCDATA)>

The #PCDATA rule allows any text. Schema are especially useful in XML-based programming applications, where they allow you to validate and restrict data inside the structure. DTDs are more common in publishing applications, partly because of the legacy with SGML. In long, technical documents that consist mostly of paragraphs, the validation provided by schema would not add a significant amount of value.

Notes
  1. Detailed information: http://www.w3.org/XML/
  2. SGML vs. XML details: http://www.w3.org/TR/NOTE-sgml-xml- 971215



Sarah O'Keefe is President of Scriptorium. She can be reached at okeefe at scriptorium dot com. End of article.

More articles like this...
Comments powered by Disqus.
RSS