Table of Contents
·The XML Data Model Is Broken ·Attributes Are Useless ·Mixed Content Is No Big Deal ·Regular Expression Types ·Types vs. Fields ·Conclusion ·Stuff That Didn’t Fit Anywhere Else

The XML Data Model Is Broken

I think XML sucks very, very hard. Almost every aspect of XML serves as an example of how to do things incorrectly. XML is hard to read, hard to write, and inefficient. But the main problem is that the data model is broken.

When I say “XML data model”, I’m not talking about the low-level format. I’m talking about the model people actually use when they deal with XML, which is essentially some form of the DTD model (XML Schema is basically the same thing).

(I’m not an XML expert, so let me know when I get something wrong. I will fix/remove it.)

The essence of the problem is that XML was designed backwards. Instead of surveying the data type requirements and designing a data model and syntax, they just used HTML’s syntax and tried to shoehorn everything to fit. HTML’s problems aren’t as visible because it only had to represent a single type of data. XML was intended to be a general-purpose format for representing arbitrary data. As a result, many flaws that were hidden in HTML are much more visible in XML.

Attributes Are Useless

How would you represent the following structure in XML?

def Person = {
  name: String
  age:  Integer
}

One option would be to allow something like:

<Person
   name="John Doe"
   age="23"
/>

Another option:

<Person>
  <name>John Doe</name>
  <age>23</age>
</Person>

The fact that there are two fundamentally different ways to represent such a primitive piece of data is an indicator of how much of a mess XML is. Let’s make the structure a little more complicated:

def Person = {
  name: FullName
  age:  Integer
}
def FullName = {
  first: String
  last:  String
}

Now the attribute-based representation wont work because attributes can only be simple strings.

<Person
  name=<FullName first="Mad" last="Goose" />
  age="34"
/>

So you no longer have a choice of representations. You have to use sub-elements when your structure’s fields aren’t all simple strings.

There are two problems here. The first is that XML has two ways of representing subsidiary data elements. Document-style data (like HTML) happens to have a natural separation between text and out-of-band metadata, so the syntatic sugar is helpful. But, in general, this is not the case.

A better design would have exposed the fact that the presence of two different forms is just syntatic sugar. Sometimes it’s just more convenient to read/write attributes, but the two forms could have been made equivalent:

<!-- canonical form -->

<para>
  <font>Monospace</font>
  <content>Hello</content>
</para>

<!-- short forms -->

<para font="Monospace">Hello</para>

<para font="Monospace" content="Hello"/>

<para content="Hello"/>
  <font>Monospace</font>
</para>

The second problem is that attributes end up being totally useless. If they can’t hold nested structures, then you can’t depend on them in general to be able to deal with your data. On a case-by-case basis, you might be able to decide that certain pieces of data can be put in attributes, but that decision has nothing to do with what the data means and everything to do with whether it can be represented as a simple string.

This is the single biggest flaw with XML and it causes many problems.

The people designing the Water programming language realized shortcoming and were forced to add the capability, calling the new format "ConciseXML". (I think the main idea behind Water, a programming language written in XML, is misguided. If you want to make it easy to use XML, you only need to handle the data model well. You don’t have to use XML’s syntax, which is utter crap for most things and even worse for a programming language. Surprisingly, someone else came up with the same idea but an even crappier implementation, Superx++. Look at the code samples to see what I mean.)

Mixed Content Is No Big Deal

“Mixed content” is the term used to describe the idea of intermixing text and XML tags. Some people see this as XML’s killer feature – something that validates its confusing design. Take the following fragment, for example:

<p>You are <em>not</em> getting away with this!</p>

That’s basically short hand for:

<p>
  <text value="You are "/>
  <em>
    <text value="not"/>
  </em>
  <text value="getting away with this!"/>
</p>

“Mixed content” is just the result of two pieces of syntactic shorthand. If you’ve programmed with XML libraries, you may have noticed these.

Free text is implicitly put in a “text” node.
Sub-nodes are implicitly put in a list.

That’s all. Mixed content isn’t some powerful beast that requires XML’s convoluted data model.

An artifact of XML’s treatment of free text is that adjacent nodes can’t be kept separate. If you had:

<hobbies>
  Sewing
  Cup Stacking
  Parking
</hobbies>

You might intend to describe something like:

<hobbies>
  <text value="Sewing"/>
  <text value="Cup Stacking"/>
  <text value="Parking"/>
</hobbies>

Since text nodes are implicit, you can’t directly control where they split. The result is that you’ll end up with something like:

<hobbies>
  <text value="Sewing\nCup Stacking\nParking"/>
</hobbies>

Regular Expression Types

There were many attempts to create a type system for XML: DTD, XML Schema, Relax NG, and probably others. Though XML is fundamentally broken, a good type system could have made it usable. Unfortunately, XML’s different type systems are all broken as well.

The original attempt at typing XML was DTD. One of its interesting characteristics was its the of regular expressions to describe structure. This is a neat trick and I definitely don’t blame the person who first came up with this idea – it’s quite creative. However, I think that it is an experiment that failed and should have been replaced by now.

Even newer type systems (XML Schema, Relax NG) still use regular expressions. For some reason, it has become accepted that regular expressions are the correct way to type XML data. This is bizzare considering the fact that most current programming languages do much better with much simpler type system.

Record Fields Are Not Ordered

Take this structure, for example:

type Boat = {
  Name:  String
  Color: String
  Class: String
}

To represent this structure with DTD, you might write:

<!ELEMENT Boat (Name,Color,Class)>
<!ELEMENT Name  (#PCDATA)>
<!ELEMENT Color (#PCDATA)>
<!ELEMENT Class (#PCDATA)>

The problem with this DTD is that it requires that the fields appear in a specific order. However a record type’s fields are not inherently ordered. The DTD is adding additional, unnecessary constraints on your data type.

There is a way around this (sort of):

<!ELEMENT Boat ((Name,Color,Class)|(Name,Class,Color)
               |(Color,Name,Class)|(Color,Class,Name)
               |(Class,Name,Color)|(Class,Color,Name))>

I can’t see any way to justify this verbosity. This isn’t like XML’s redundant end tags where, at worst, the document is twice the size it needs to be. Here, the size increase is exponential. WTF!

The XML Schema and Relax NG schema formats do have special cases to handle this situation better than DTD does. Unfortunately, they’re both still rooted in regular expression-based types, which is still a bad idea. They’ve only placed a bandage over the most visible problem.

I believe DTD allows attributes to appear in any order. However, due to previously stated reasons, attributes are not useful in general.

Programming Language Support

Part of the reason XML is such a pain to program with is that regular expression types don’t map well to current programming language types and control structures. There have been interesting efforts to create programming languages with direct support for “regular types” (XDuce, Xtatic (same guys), C-omega (from Microsoft)). In fact, direct support for XML seems to have come into vogue recently.

Now, the people working on this stuff are much smarter than I am. They have come up with new and interesting techniques for dealing with data that is typed by regular expressions. However, I think the original goal – making it easier to program with XML – is misguided. Existing programming languages have developed much better ways to deal with data. Why bend over backwards accomodate a broken data model when you already have a better one?

Types vs. Fields

In most programming languages, it’s easy to figure out when you’re talking about a type and when you’re talking about a field in a record. XML blurs that distinction:

<Person>
  <name>Douglas</name>
  <age>42</age>
</Person>

<PetShop>
  <Cat>Garfield</Cat>
  <Dog>Odie</Dog>
</PetShop>

On the surface, the two fragments are structurally identical. Semantically, however, they aren’t. The first is a “Person” object with two fields. The second is a “PetShop” object whose only field is a list of other objects. This ambiguity causes various problems.

Redundant Container Tags

XML sucks at something as simple as reusing a common type in two places. Take an address book data type, for example:

def PhoneNumber = {
  AreaCode: String
  Number: String
}
def TContact = {
  Name: String
  HomePhone: PhoneNumber
  WorkPhone: PhoneNumber
}

<Contact>
  <Name>Wolverine</Name>
  <HomePhone>
    <PhoneNumber>
      <AreaCode>123</AreaCode>
      <Number>555-1234</Number>
    </PhoneNumber>
  </HomePhone>
  <WorkPhone>
    <PhoneNumber>
      <AreaCode>123</AreaCode>
      <Number>555-4321</Number>
    </PhoneNumber>
  </WorkPhone>
</Contact>

In the data above, the extra <PhoneNumber> tags are totally unnecessary. In most programming languages, you don’t have to repeat yourself like this. The problem is that even though HomePhone is a field and PhoneNumber is a data type, XML can’t tell the difference.

A decent XML typing system could make this work correctly, but DTD and XML Schema do not.

This problem appears in HTML as well. The <body> and <head> tags are fields within <html> while <b> and <i> are objects within <p>.

Field Names Should Be Context-Sensitive

Though you can get by with context-insensitive type names, field names must be context-sensitive. A side-effect of XML’s confusion between field names and record names is that neither is context sensitive.

Let’s say you had the following structure:

def Person = {
  Name: {
    First: String
    Last:  String
  }
  Pets: Pet[]
}
def Pet = {
  Name:    String
  Species: String
}

<Person>
  <Name>
    <First>Goose</First>
    <Last>Man</Last>
  </Name>
  <Pets>
    <Pet>
      <Name>Scrappy Doo</Name>
      <Species>Dog</Species>
    </Pet>
  </Pets>
</Person>

The problem with this fragment is that <Name> means two different things. Most programming languages are perfectly fine with this because “name” is a field in a record. To get this working with DTD, we’d have do something like rename the fields to “Person-Name” and “Pet-Name”.

This wouldn’t be a problem if XML attributes could hold nested structures; even DTD treats attributes in a context-sensitive way. It’s really too bad that attributes are so useless.

Conclusion

XML is terrible at handling many types of data. The worst part is that most existing programming languages use type systems that do a much better job at handling the same data. In terms of data representation, XML is a step backwards.

Though some of the problems can be mitigated by a decent type system… * Some of the problems are inherent in XML serialization format itself. * XML naturally lends itself towards a bad data model. Any type system that tries to keep with the spirit of XML (as opposed to beating it into submission) is doomed to fail.

I think that a format founded on algebraic data types would be a better idea. Here’s why.

Stuff That Didn’t Fit Anywhere Else

Relax NG

Relax NG is the latest “big” XML type definition effort.

It’s “interleave” tag solves DTD’s problem with exponential size increase when you don’t care about the ordering of child elements (see “Record Fields Are Not Ordered”, above).

Though Relax NG doesn’t distinguish between fields and types, it does, according to the documentation, treat elements in a context-sensitive way, fixing one of the problems (see “Field Names Should Be Context-Sensitive”, above).

And, believe it or not, they realized that writing data types is common enough to deserve special syntax. It’s a sign that XML people are actually making some rational decisions. As long as the data model matches up you can load it and use XSLT (or whatever) to mangle it to your heart’s content. That’s an extra parser that needs to be maintained but you’ll be saving a disproportionately large amount of time for schema writers.

Though Relax NG improves over DTD and XML Schema in many ways, it has a fatal flaw. In XML Schema, you basically define a data type. In Relax NG, you define a pattern to match XML documents against. This is a step backwards; instead of describing the high-level model of the data, your dealing with parsing details.

I can see how the Relax NG approach may seem attractive; not having to worry about building an actual data structure out of an XML document lets you do weird things (ambiguous grammars, interleaving ordered groups, etc.). At first, these may seem to indicate greater “power”, “flexibility”, or “expressiveness”. And people familiar with the reference implementation end up gawking at the elegance of the underlying theory that they don’t realize what they’ve lost in exchange.

Lisp S-Expressions

Us XML-haters fall into different categories. One category consists of people who claim that XML is just a poor copy of Lisp S-Expressions. I do not fall into that category.

First of all, I don’t think Lisp S-Expressions are any more appropriate as a general-purpose data type. The structure is too loose, missing out on the opportunity to classify data more accurately. However, the extreme simplicity of S-Expressions makes this deficiency somewhat forgiveable.

Though XML provides some extra features to more accurately describe your data, it is much more complicated while not being more powerful than S-Expressions in general.

Apple Property Lists (PList)

This format is like Lisp S-Expressions with enough additional structure to make them usable. One disadvantage is that there aren’t any typing mechanisms to enforce user-defined structure (like DTD for XML). At least this guarantees that they don’t have a broken type system.

Apple has decent binary and human-readable representation formats for PLists. More recently, they added an XML serialization format called “PList XML”. PList XML is pointless.

YAML

YAML is a lot like Apple PLists. I really like the indentation-sensitive syntax. I don’t think YAML has any “official” typing mechanism, but one particular mechanism is called Kwalify and it looks decent. I still thinks YAML has some fundamental problems which end up tainting Kwalify as well, but both seem to be much more sensible than XML.

One potential technical deficiency YAML has when compared to XML is that it can’t represent document-style data well. YAML is way better at representing non-document-style data, but that’s not enough to dethrone XML. YAML needs to augment mix its current syntax with constructs that are can handle document-style data.