First Draft
10/16/2005; 1:23:22 AM
The question if certain information should be encoded in XML using elements, or using attributes predates XML. Already in SGML this was an issue. Robin Cover has created an excellent page containing opinions and practical issues from various people.
In this article I want to introduce the reader to the issues considering data modelling. What happens to the data model when we switch from attributes to elements? Can you just do that, is there even a choice to be made? If we want to answer these questions, we must first know what an attribute means and what an element means.
Robin Cover writes: "Experienced markup-language experts offer different opinions as to whether general principles can be given for choosing attributes over elements, and if so, what principles are most useful. Most agree that it's an 'implementation decision,' which reveals (arguably) that SGML/XML is not an ideal language for data modelling." So maybe there isn't a proper data model. But we find more clues in the commentary by Michael Sperberg-McQueen. He writes:
- use an embedded element when the information you are recording is a constituent part of the parent element
- use an attribute when the information is inherent to the parent but not a constituent part (one's head and one's height are both inherent to a human being, i.e. you can't be a conventionally structured human being without having a head, and having a height, but one's head is a constituent part and one's height isn't -- you can cut off my head, but not my height)
What Michael describes here is the 'attributed tree' data model. To avoid confusion, let's call the attributes in the data model properties. So an attributed tree is a tree of nodes, each node having a set of properties and a list of childnodes. This seems simple: elements are nodes, and attributes are properties. However, the problem with attributes is that their values can only be strings, and strings don't have any structure. In practice this has proven to be too restrictive, i.e. property values require some structure, so they had to be stored in elements.
When we want to try to parse the elements and attributes into our data model, it isn't clear what to do with the element names. Attribute names map to property names, attributes values to property values, and each element value to a childnode. It turns out that it is the lack of meaning of the element name that makes it hard to map XML to a data model like RDF. It gives the user a freedom to label elements according to his own rules, which may seem a logical way to this user, but it's often not compatible to the rules of another user.
Let's view an example for each of the possible meanings in the context of an attributed tree:
Property names:
<firstname>...</firstname> <lastname>...</lastname> <address>...</address>
Property values:
<Washington>...</Washington> <Adams>...</Adams> <Jefferson>...</Jefferson> <Madison>...</Madison>
Here the element names are actually the value of the lastname property. The element name functions as an index, and is practical for XPath expressions.
List indexes:
<_1>...</_1> <_2>...</_2> <_3>...</_3>
This is pretty useless because counting is better left to the computer. But it could be used to ensure that the element order is preserved.
List values:
<x/><plus/><y/><times/><pi/>
This is a possible representation of a list of tokens.
No meaning at all:
<item>...</item> <item>...</item> <item>...</item>
To make things even worse (from a data modelling perspective) in practice you'll find that
several forms are mixed.
Kind #2 is most often used, especially with the type property. RDF specifically allows this, through
the third abbreviated form.
The XML specification
says "The Name in the start- and end-tags gives the element's type." To be clear: what they say is the
type of the element, which does not have to be the type of the content of the element. Later, in
XML Schema the element name and it's type are
completely decoupled. In RDF the element name
can either be the value of the rdf:type
property or a property name,
depending on context.
We have seen that there are two problems when trying to match the attributed tree with the XML syntax:
- Structural requirements prevent the use of attributes for all properties.
- The lack of proper meaning of the element name leads to storing various kinds of information from the attributed tree into the element name.
One possible solution is to do whatever you want, and add information (possibly through a schema) about what each element name means. This can be different for each possible combination of parent and child element. It's not an elegant solution, but at least would make all current XML files more meaningfull to a computer, and maybe even parsable into an RDF database.
Another solution is given by
Eliot Kimber: "Many DTD designers provide a metadata container that distinquishes metadata
for an element from its content". This design can be recognized in the <head>
and <body>
elements in HTML. This solution is usually not used throughout
the whole document.
A third possible solution emerged in the SML-DEV discussion group. A large portion of the group had agreed on the minimalistic syntax of Minimal XML. Minimal XML is a stripped down version of XML, leaving only elements. Trying to work with this format made the overloaded use of the element name very clear. And the group could not agree on a simple data model for Minimal XML. Both the list model (normal indexed arrays) and the map model (associative arrays) are required, with only elements to work with. The solution was as follows:
- For lists: Reserve one element name. This name is then used for elements that build up the list of childnodes. In the SML-DEV group this name was "_" (the underscore), for 'historic' reasons, but the underscore also represents the unnamed nature of lists.
- For maps: All other element names are used to represent the properties. Between sibling elements, these names must be unique.
This leaves a very small subset of all possible XML documents, that's easy to parse, has a clear and simple data model and is therefore easy to handle.
Although it will often not be possible to create XML documents that have meaningfull element names and wise use of attributes, the notion of how the elements and attributes should fit in a data model helps. It also helps to explain why some of the W3C standards are so complicated or so lengthy: It's impossible to make any assumptions about the meaning of each part of the XML file, so every possibility has to be covered.