Recent changes
Table of Contents

Data Representation

This page describes a way to represent data that I think is particularly good. It’s not perfect, but it’s a lot better than many of the other ones I’ve seen. Since it’s easier to talk about something if it has a name, let’s call the scheme “CKS”.

The goal is for CKS to be used as a model for easily defining serializable data types. I think this is what XML’s creators set out to do. Unfortunately, XML is a bad model and there are lots of data types for which XML is a horrible format.

There’s nothing really new here. The CKS type model is built on algebraic data types, which been around for a long time and are the basis for data types in functional programming languages like Haskell and ML.

Primitives

Any data model needs primitive types. The basic primitive type in CKS is called “void” (and written “()”). There is only one value that is of type void and that value, also called “void” (and also written “()”).

The void type contains/provides no information at all because it must always have the void value. It should become clear later on why this type is useful.

Fundamentally, the void type is the only primitive type you ever need. All other values can be defined by creating compound types. But, for practical reasons, it’s useful to have types like “String” and “Integer”. Values for these types are written in the usual way:

Aliases

For convenience, you can create type aliases by writing: “def AliasName = Type”. For example:

def Nothing = ()
def Text = String

The first line creates a new type “Nothing” that is equivalent to the void type.

Compounds

To do anything useful, we need more complex types. A compound type is type that is composed of members. Each member itself has a type and a tag. The tag is just unique a name for the member within the compound type.

Name: String
Age: Integer
X: ()

Listed above are three members. The tag is on the left of the colon and the type is on the right. The first member’s tag is “Name” and its type is “String”.

In an compound type, the order of the members doesn’t matter. The only requirement is that the tags be unique within the group.

Entries can be combined in different ways. When you want a compound value that consists of all of the members, you create a record type (a.k.a. a product type). When you want a compound value that consists of exactly one of the members (i.e. a “sum”) you create a variant (a.k.a. a sum type).

Records

A record is a set of members in which all the members are active. Each member of a record is a field. This is basically a “struct” in C. Most languages support this in some form.

For CKS, we’ll write record types by surrounding members with braces:

def Person = {
  Name: String
  Age: Integer
}

Record values are also surrounded by braces. Each field value is written in the form: “Tag = Value”.

{
  Name = "Scooby Doo"
  Age = 42
}

Some languages allow data aggregation through “tuples”. Tuples are records where the fields’ tags are implicit (based on the field’s position in the tuple). For example, the tuple “(String,Integer,Integer)” is basically equivalent to the record:

{
  _1: String
  _2: Integer
  _3: Integer
}

Variants

A variant is a set of members in which exactly one of the members is active. Each member of a variant is an option. This is sometimes called a “disjoint union” or a “tagged union”.

For CKS, we’ll write variant types by surrounding the options with angle brackets:

def PackageTracker = <
  BikeMessengerName: String
  UpsNumber: Integer
  NoPackage: ()
>

Variant values consist of the active option’s tag followed by the active option’s value. Here are three separate variant values:

BikeMessengerName "Bob"
UpsNumber 12345
NoPackage ()

The void type (“()”) is useful here because, even though it contains no information, it’s exactly what we want. The NoPackage option doesn’t need to carry any additional information.

For convenience, the “()” type can be omitted. If no type is specified, void is assumed. The “()” value can be omitted as well. If the type of a value is void, then the “()” is implied (after all, what else could it be?).

def PackageTracker = <
  BikeMessengerName: String
  UpsNumber: Integer
  NoPackage
>

BikeMessengerName "Bob"
UpsNumber 12345
NoPackage

The C language (and many of its derivatives) have a feature called “enumerations”, which are just a special case of variants; the type of each option is always “()”. For example, the following C enum:

enum Direction {
  Up,
  Down,
  Left,
  Right,
};

is similar to the following CKS variant:

def Direction = <
  Up: ()
  Down: ()
  Left: ()
  Right: ()
>

Example

To give you a feel for what you can do by combining these two compound types, here is an example address book data type.

def AddressBookEntry = {
  Category: < Personal, Business >

  Name: {
    First: String
    Last: String
  }

  DaytimeContact: Contact
  EveningContact: Contact
}

def Contact = <
  Phone: String
  InstantMessenger: <
    AOL: String
    Yahoo: String
    ICQ: Integer
  >
>

Example value:

{
  Category = Personal
  Name = {
    First = "Scooby"
    Last = "Doo"
  }
  DaytimeContact = Phone "(123) 456-7890"
  EveningContact = InstantMessenger Yahoo "SwtK9PrincessLOLZ"
}

Parametrized Types

A parameterized type is a type definition that has placeholders in it. For example:

def Boxer(N) = {
  Name: N
  Weight: Integer
}

The type alias “Boxer” requires a type be passed in as a parameter before it becomes a concrete type.

def NormalName = {
  First: String
  Last: String
}
def RobotName = {
  Model: String
  SerialNumber: Integer
}
def FloosieName = String

The following two type definitions are roughly equivalent:

def NormalBoxer = Boxer(NormalName)

def NormalBoxer = {
  Name: {
    First: String
    Last: String
  }
  Weight: Integer
}

Deriving The Essentials

That’s all there is to core CKS. The rest can be derived in terms of records, variants, and “()”.

Booleans

def Boolean = <
  True: ()
  False: ()
>

Since the “()” can be omitted, this can also be written:

def Boolean = <
  True
  False
>

Optional Values

It is common for field values to be optional. For example, let’s say you’re modelling a web form where entering your birthdate is optional:

def FormData = {
  Name: String
  Birthdate: <
		Birthdate: String
		NotGiven
	>
}

Example values:

{
  Name = "Scooby Doo"
  Birthdate = Birthdate "13 Sep 1969"
}
{
  Name = "God"
  Birthdate = NotGiven
}

The notion of an optional value is common enough that we should probably just create a generic type that can turn any type into an optional type.

def Optional(T) = <
  Present: T
  Missing
>

Now the “FormData” type could be defined:

def FormData = {
  Name: String
  Birthdate: Optional(String)
}

The example value would now be:

{
  Name = "Scooby Doo"
  Birthdate = Present "13 Sep 1969"
}
{
  Name = "God"
  Birthdate = Missing
}

Lists

Lists are very common and should be easy to use. However, it’s important to notice that we don’t absolutely need special support for lists because they can be built with the existing type system. Most functional programming languages do this by chaining together linked list nodes. For example, the datatype for a list of Strings would be:

def StringList = <
  Link: {
    Current: String
    Next: StringList
  }
  End: ()
>

StringList” is the type of the list. An empty list would simply be:

End

A list with one element would be:

{
  Current = "Scooby Doo"
  Next = End
}

A list with three elements would be:

{
  Current = "First"
  Next = Link {
    Current = "Second"
    Next = Link {
      Current = "Last"
      Next = End
    }
  }
}

The generic version:

def List(T) = <
  Link: {
    Current: T
    Next: List(T)
  }
  End
>

def StringList = List(String)

No, it’s not pretty. But it can be done.

Syntatic Shorthand

Though a miniamilistic core is nice to have, some shorthand syntax for common patterns would be helpful.

Optional Values

Our optional value generic type isn’t too inconvient, but some syntactic sugar can help. The question mark would make for an intuitive indicator of optionality.

def FormData = {
  Name: String
  Birthdate: String?
}

{
  Name = "Scooby Doo"
  Birthdate = "13 Sep 1969"
}
{
  Name = "Scooby Doo"
  Birthdate = ?      # Can explicitly say "no value" by using "?"
}
{
  Name = "Scooby Doo"
  # Can implicitly say "no value" by omitting the field
}

Lists

Creating list values is prohibitively inconvienient with our existing definition. A more convenient syntax is absolutely necessary. Haskell and ML use the following syntax for list values:

[]     == End ()
[1]    == { Current = 1, Next = End () }
[1, 2] == { Current = 1, Next = Link { Current = 2, Next = End () } }

List types aren’t difficult to write (the type for a list of “String” values is just “List(String)”), but Haskell uses a type syntax that is similar to the list value syntax:

[String]   == List(String)
[[String]] == List(List(String))

XML-ish Syntax

XML’s general syntax (taken from HTML) is convenient for document-style data. For example:

<p>What are <em>you</em> doing? <a href="whatiamdoing.html">Answer</a>.</p>

Translated into CKS’s standard syntax, it would look something like:

p {
  content = [
    text "What are ",
    em { content = [text "you"] },
    text " doing? ",
    a {
      href = "whatamidoing.html"
      content = [text "Answer"]
    },
    text "."
  ]
}

If you’ve done any XML programming, you’ll notice how the CKS version looks like the data structure you’d get back from a DOM library. The convenience of XML’s syntax comes from two simple optimizations:

With those two optimizations in mind, we can easily create a parser that knows how to parse XML-ish syntax and construct a CKS data value.

Omit Unnecessary Tag Names

def Person = {
  Names: [String]
  Age: Integer
}

The “.” means that the tag name can be inferred from the context.

<Person>
  <Names>
    <.>Scooby</>
    <.>Dooby</>
    <.>Doo</>
  </Names>
  <Age>45</Age>
</Person>

{
  Names = ["Scooby", "Dooby", "Doo"]
  Age = 45
}

Attributes And Sub-Elements Are Interchangeable

def Person = {
  Name: FullName
  Age: Integer
}
def FullName = {
  First: String
  Last: String
}

The following fragments are all equivalent once parsed into a CKS data type:

<Person>
  <Name>
    <First>Scooby</First>
    <Last>Doo</Last>
  </Name>
  <Age>45</Age>
</Person>

<Person>
  <Name First="Scooby" Last="Doo"/>
  <Age>45</Age>
</Person>

<Person Age=45 Name=<FullName First="Scooby" Last="Doo"/>>

<Person Age=45 Name=<. First="Scooby" Last="Doo"/>>

{
  Name = {
    First = "Scooby"
    Last = "Doo"
  }
  Age = 45
}

Implementation

The core is done. You can get it from http://cakoose.com/tools/cks/.

Currently, you tell it to parse a file and you get back and AST (kind of like existing XML DOM libraries). It’s missing the ability to automatically bind to objects in the host language (like JAXB, Castor, etc.).

data/data_representation.txt Last modified: 01.10.2008 00:37
Driven by DokuWiki