If you find a mistake, or something is unclear, please email per@bothner.com so I can fix the text.

The XQuery Data Model and Types

by Per Bothner

We have earlier looked at the values (nodes, primitives, and sequences) that XQuery works with. In this article we will look more deeply into the XQuery/XPath data model and type system. On the way we will touch on a fair bit of background material, including XML Schemas and XML infosets.

XML Infosets

The XQuery data model is based on the XML Information Set standard (W3C Recommendation 24 October 2001, http://www.w3.org/TR/xml-infoset). It rather abstractly defines the information content of an XML document as a document item that contains nested element items, which in turn contain namespace, attribute, character and other items. This is a conceptual standard: It does not define any file formats or programming interfaces, but rather it defines the interpretation of an XML file. It is intended to be useful for defining other XML-related standards, including the XQuery/XPath data model.

XML files that are different at the character level but that have the same information set or infoset are for most practical purposes equivalent. For example:

<a   b='upcase'  ><![CDATA[Hello!]]></a>

and

<a b="upcase" >Hello&#33;</a>

have the same information sets.

The "Canonical XML" recommendation is a related standard in that it specifies a unique ("canonical") way to convert an XML infoset (back) into an XML document. Two XML documents that are "logically equivalent" (i.e. have the same infoset) translate to the same canonical XML representation. The Canonical XML for the above example is:

<a b="upcase">Hello!</a>

A parsed XML file results in an infoset, but there can also be synthetic infosets that are constructed from other sources, such as a database, or created by a program that manipulates a DOM (Document Object Model). A DOM is a popular data structure API used to encode and manipulate XML data - i.e. infosets.

The XQuery language also allows you to create infoset items, using element constructor expressions or pre-defined functions.

Node Values and Types

Node values represent the parts of a XML document, or more generally an XML infoset. Nodes are also used to represent document fragments - i.e. stand-alone nodes that are not part of a document (for example, those that might be generated by an element constructor expression).

These are the kinds of nodes, most of which are as you would expect:

Document nodes represent a complete XML document.
Element nodes present XML elements.
Attribute nodes represent the attributes of an element. Note that namespace definitions are represented by namespace nodes instead.
Namespace nodes represent the in-scope attributes of an element. (You cannot actually "get at" a namespace node in XQuery, though you get at them in XPath using the namespace:: axis, which has been deprecated in XPath 2.0.)
Processing instruction nodes represent embedded XML processing instructions.
Comment nodes represent XML comments.
Text nodes represent character data. Note that an infoset consists of single-character items, but in the node representation multiple contiguous character items are represented in a single text node.

The XQuery 1.0 and XPath 2.0 Data Model specification (http://www.w3.org/TR/query-datamodel/) goes into details about the different kinds of nodes. It also defines a number of functions on nodes, using the prefix dm. For example, the function dm:node-kind takes a Node and returns a string value that represents the node's kind, one of "document", "element", "attribute", "text", "namespace", "processing-instruction", or "comment". Note that these functions are only to explain the data model: You cannot call them from an XQuery program, and the prefix dm isn't actually bound to any namespace. However, in some cases there may be a function in the Functions and Operators document with the same name and behavior. Those are available to your XQuery programs. For example there is a fn:node-kind function which you can use, and which is defined to return the same result as dm:node-kind.

Node identity and hierarchy

Nodes in XQuery are immutable, which means you cannot change any part of a node once it has been created. This makes sense, since XQuery is a pure side-effect-free expression language. However, nodes do have identity: two nodes that were created using different expressions are distinct nodes, even if they contain the same data. You can compare the latter using the standard function fn:deep-equal. The following examples are all true:

fn:deep-equal(<a>test</a>, </a>test</a>)
<a>test</a> isnot </a>test</a>
let $x := <a><b></b></a> return $x/b is $x/*

Because nodes have identity, you can talk about making a copy of a node, because you can distinguish the original from the copy. However, atomic values do not have identity: There is no way to copy the string "xyzzy" because there is no way to distinguish the copy from the original. That is the difference between a string atomic value and a text node, which you can copy.

A node may have children, which are the nodes below it in the tree hierarchy, but not including attribute or namespace nodes. For example, the children of an element node are the text nodes and nested elements (and occasionally other nodes) that are its contents. The function dm:children takes a node and returns the sequence of its children. (This function only exists in the data model; to get the children of a node N in XQuery use the expression N/node().) Only document and element nodes can have children, so dm:children returns the empty sequence for other node types.

For each node X that is a child of Y, the node X has a parent property that is the node Y. Using the data model, you can get at Y using the dm:parent function; in an XQuery program you have to use the expression X/parent::node(). These properties have some surprising consequences. Because nodes are immutable, you have to specify the children of an element or document when you create it. However, those children have to have their parent property set to the new node - but you can't modify them, as they are immutable. This chicken-and-egg problem is solved by creating new copies of the children, with the parent property of the new nodes set to the new parent. Any children of the children also have to be copied.

Note, however, that this copying of nodes is part of the specification, but an implementation is free to optimize away the copying if it doesn't change the result. For example, consider the following expression:

<a>Some <b/>{<c>text</c>}</a>

The specification says that <b> and <c> nodes are created, and then copied when <a> is created. But since there is no way to access the old <b> and <c> nodes, an implementation is free to just re-use the old nodes without copying them, or it can create them in-place at the same time it creates <a>. This is an example of the important difference between specification and (valid) implementation. The lack of side effects in XQuery gives the implementation extra flexibility in choosing how to implement things. A possible disadvantage is that it makes it hard to estimate how much work is done for an XQuery program, unless you are very familiar with your implementation. On the other hand, you usually don't need to know.

More generally, an implementation is free to represent nodes in any way compatible with the specification. An obvious choice is to use the standard Node type specified in the W3C's Document Object Model (DOM) (http://www.w3.org/DOM). However, though DOM is a flexible and convenient API, it is quite space-inefficient. As an example of an alternative representation, the Qexo implementation (http://www.gnu.org/software/qexo/) uses a single TreeList for an entire document. The TreeList contains an internal array, and node objects are identified by indexes into that array. (The Apache Xalan XSLT processor uses a similar Document Array Model representation.) In fact, an implementation may in some cases not create actual node objects at all. Consider that the ultimate result of evaluating an XQuery expression is often written out to a file as a new XML document. In that case the XQuery processor can write out the nodes on-the-fly directly to the output file, without ever creating any nodes. More generally, the XQuery processor can "write" the output to a SAX DocumentHandler or a similar event-driven interface.

String value and typed value of nodes

Sometimes it is useful to take a node, and convert it to a string value. The function fn:string does that. The string value of a text node is the characters in the node. The string value of an element or document node is the concatenation of the text node descendents of the node in document order. The string value of an attribute node is the attribute value.

There is also the typed value of an element, attribute, or text node, which you can extract using the fn:data function. This is the value of a node as a sequence of atomic values, as the result of Scheme validation. If an element node has a complex type, then the typed value is undefined.

Types and Type Systems

The XQuery and XPath languages are typed expression (functional) languages. This means that programs are made from expressions (which may in turn contain sub-expressions), and that evaluating an expression results in a value, which has a type.

Informally, a type is a set of values: those values that are instances of or belong to the type. The type system of a programming language is the collection (vocabulary) of types that the language definition distinguishes, including the rules for determining whether a value is an instance of a type, and for how to create complex types from simple types.

A type error occurs when the operands of an operation have types that are not allowed for that operation. For example, in XQuery you can add two numbers using the + operator, but you can't add two nodes, even if the nodes contain integer values. If your program tries to add two nodes, the XQuery processor should give you an error message instead.

It is useful to distinguish between the dynamic types and static types:

The type of a value is a dynamic type. Dynamic types exist during evaluation (at run-time). Dynamic types are sets of values that are instances of the type. A type specifies the meaning or interpretation of a value.
The type of an expression (a program fragment) is a static type. Static types are the types of declarations and program fragments as specified by the programmer or inferred by a compiler. If an expression has a (static) type and you evaluate the expression without a run-time error, then the result is guaranteed to be an instance of the corresponding static type.

A dynamically typed language is one that doesn't have static types. Another way to say the same thing is that there is only a single type, which contains all values. In those languages, all type errors are run-time errors. The goal of a static type system is to detect type errors at compile time, before actual execution. This is a process called type checking, and in some languages (including XQuery) is a fairly complicated process.

Static type checking lets you detect and fix errors earlier. This is especially valuable for infrequently executed parts of a program, since they are less likely to get much testing. As a side benefit, if the compiler can determine the type of an expression, it may be able to generate more efficient code, and so the query may execute faster.

The XQuery and XPath languages specify both dynamic types and static types. The static type checking is optional, both for implementors and users: An XQuery implementation need not implement the static typing feature, and implementations that do implement static typing will have an option to disable it.

We will discuss static typing later, but first we will study dynamic typing, including the kinds of values that XQuery and XPath deal with. The data model is part of dynamic typing.

The Data Model: Items and Sequences

The values worked on by an XQuery program are sequences of items. An item is either an atomic value (for example an integer or a string) or a node (for example an element or an attribute).

A sequence is a collection of zero or more items. The most important idea to note is that not only are all sequences values, but also all values are sequences, because a sequence of just a single value is in all respects the same as the single value. It follows from this that you cannot nest sequences - you cannot have sequences of sequences, only flat single-level sequences.

If you have experience with arrays or lists in other programming languages, you might think it is a strange and limiting restriction that you can't nest sequences. Actually, it isn't really a limitation, because you can always uses nested elements if you need nested data. For example, to represent a two-dimensional array you can use nested elements like this:

<list>
  <list>11 12</list>
  <list>21 22</list>
</list>

A major difference between XPath 1 and XPath 2 is that the latter has sequences, while the former does not. Instead, XPath 1 has node sets, which are like sequences, but without duplicates, and in unspecified order. XPath 1 path expressions evaluate to node sets, while in XPath 2 (and XQuery) path expressions evaluate to node sequences. However, the latter sequences are defined to be sorted in document order and with duplicates removed. (These are actually equivalent, in that you can map a set into a sequence that is ordered and without duplicates, and back again, without information loss. Furthermore, any valid XPath 1 expression will behave the same under either model.)

Atomic Values and Types

The XQuery/XPath primitive types are the same as in XML Schema, which is a standard for specifying element structure of XML data, and associating types with XML data.

Atomic values include numbers, values, and booleans. There are two kinds of atomic type:

A primitive type is not defined in terms of some other type.
A derived type is based on some other type, its base type.

A derived atomic type is a restriction of its base type, because it is a restriction (sub-set) of the set of atomic values that belong to the base type.

Following is a complete list of the built-in types defined by XML Schema. We will only list them briefly; for more information see the W3C Recommendation (02 May 2001) of XML Schema Part 2: Datatypes (http://www.w3.org/TR/xmlschema-2/). This specifies for each type its value space (the abstract values that belong to the type), its lexical space (the text representation of values using printable characters), and its facets (properties of the type itself).

XML Schema defines the following builtin types:

A boolean is one of the two truth values true and false.
A string is zero or more Unicode characters.
There a various sub-types of string: A normalizedString is a string that does not have any whitespace characters except for space. A token is a normalizedString that has no leading or trailing spaces, and does not have two or more spaces in a row. A language is a token used to specify a natural (human) language. An NMTOKEN is a token consisting of one or more NameChar characters, as defined in the XML standard. A Name is a token used to represent XML names, such as body or html:table. An NCName is a plain Name without a colon, such as body. The types ID, IDREF, and ENTITY are sub-types of NCName used for special kinds of attribute values as specified in the XML standard. The types IDS, IDREFS, ENTITIES, and NMTOKENS are used for space-separated sequences of the corresponding tokens.
A NOTATION is used for attributes that specify the notation (encoding) of an element. However, NOTATION is not a sub-type of string.
An anyURI represents a Uniform Resource Identifier Reference, such as http://www.w3.org/.
A QName represents an XML qualified name, which is a pair of a namespace name, and a local part. Note that in the value space a namespace name is an anyURI (such as http://www.w3.org/1999/xhtml), while the lexical representation uses namespace prefixes (as in xhtml:body). Therefore mapping between the two requires a context that contains the needed namespace declaration.
A decimal is an arbitrary-precision real number, in base 10. An integer is a decimal without a fractional part. The types nonPositiveInteger, negativeInteger, nonNegativeInteger, and positiveInteger are the obvious sub-types of integer. The types long, int, short, byte, unsignedLong, unsignedInt, unsignedShort, and unsignedByte are sub-types of integer that can be encoded in binary using respectively 64, 32, 16, or 8 bits.
The types float and double correspond to 32-bit and 64-bit IEEE binary floating-point real numbers. The standard lexical representation uses decimal format, with an optional exponent, such as -58.45 or 1.25e-10, even though these types are not sub-types of decimal.
There are a number of time-related types: A date is a calendar date, such as May 31, 1999 (written 1999-05-31). A time is an instant that occurs every day, like 1:20pm (written 13:20). A dateTime is a specific instant of time, like 1:20pm on May 31 1999 (written as 1999-05-31T13:20). Any of these may have an optional timezone specified. A gYear is a specific year in the Gregorian calendar, while a gMonthYear is a specific year and month. A gMonthDay is a month and day that recurs every year, a gMonth is a month that recurs every year, and a gDay is a day that recurs every month. A duration is a duration of time, like 2 days and 1 hour (written as P2D1H). The XQuery/XPath committee has added two sub-types of duration, which may get added to future Schema revisions: xdt:yearMonthDuration (a duration of some number of years and months) and xdt:dayTimeDuration (a duration of some number of days, hours, minutes, and seconds).
The types hexBinary and base64Binary are used to encode arbitrary binary data. The value of either is zero or more octets (8-bit bytes). A hexBinary uses two hexadecimal digits for each octet, so 0FB7 encodes the 16-bit integer 4023. A base64Binary uses the Base64 MIME Content-Transfer-Encoding.

The union of all primitive types is anySimpleType.

All of these standard types names are in the http://www.w3.org/2001/XMLSchema namespace, conventionally written using the predefined namespace prefix xs, as in xs:string.

The XQuery specication adds four types: the duration types xdt:yearMonthDuration and xdt:dayTimeDuration are mentioned above; xdt:anyAtomicType includes all the atomic values; and xdt:untypedAtomic is a type used for untyped data, such as text that has not been validated. All are subtypes of anySimpleType, and are in the http://www.w3.org/2003/05/xpath-datatypes namespace.

Schemas and complex types

The word schema comes from the database community, and means a description of the structure, types, and relations of a database. In the XML world a schema is a description of the syntax and meaning (types) of a class of XML documents. A schema language is a formalism for specifying the types of documents as schemas.

The earliest XML schema language is DTD (Document Type Descriptor), which appears in the original XML specification from 1997, and goes back to the SGML roots of XML. DTD is a simple language that lets you express simple structural constraints. For example, the following:

<!ELEMENT tr td*>

means that a <tr> element consists of zero or more <td> elements.

DTD does not have any mechanism for specifying semantic or type information, except in a very few cases. Other schema definition languages allow you to define and specify types.

XML Schema (http://www.w3.org/XML/Schema) is a 2001 specification from W3C that can be used to specify structural constrains and associate type information with XML documents. While there are other Scheme language in use, this is the one with most usage and visibility, partly because it is a W3C standard. The type semantics of XQuery/XPath2 are defined in terms of XML Schema.

As an example we will use the record of a series of dice throws. Perhaps you want to verify the dice are fair, or you want a source of random numbers, or you want search for mystical patterns.

<?xml version="1.0"?>
<die-tests>
  <die-test>
    <who>Nathan</who>
    <when>whenever</when>
    <throws>5 2 2 2 1 3 6 6 2 6</throws>
  </die-test>
  <die-test>
    <who>Per</who>
    <when>2002-10-09T09:07</when>
    <throws>6 2 5 2 2 3 3 3 4 1</throws>
  </die-test>
</die-tests>

The Schema for this might look like the following:

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

  <xsd:element name="die-tests">
    <xsd:complexType>
      <xsd:sequence>
        <xsd:element name="die-test" type="die-test-type"
          minOccurs="0" maxOccurs="unbounded"/>
      </xsd:sequence>
    </xsd:complexType>
  </xsd:element>

  <xsd:complexType name="die-test-type">
    <xsd:sequence>
      <xsd:element name="who" type="xsd:string"/>
      <xsd:element name="when" type="xsd:dateTime"/>
      <xsd:element name="throws">
        <xsd:simpleType>
          <xsd:list itemType="die6-result"/>
        </xsd:simpleType>
      </xsd:element>
    </xsd:sequence>
  </xsd:complexType>

  <xsd:simpleType name="die6-result">
    <xsd:restriction base="xsd:integer">
      <xsd:minInclusive value="1"/>
      <xsd:maxInclusive value="6"/>
    </xsd:restriction>
  </xsd:simpleType>
</xsd:schema>

This is verbose, and may be a bit intimidating, but it is relatively straightforward. It contains a top-level element declaration for the root element <die-tests>, as well as type definitions for types named die-test-type and die6-result. A simple type (such as die6-result) can only be expressed as character data. A complex type (such as die-test-type) can consist of attribute specifications, and either sub-elements (complexContent), character data (simpleContent), or a mixture of these (complexContent, mixed="true"). The type of an attribute can only be a simple type, while the type of an element can be either a simple type (if it only contains text data), or it can be a complex type. All pre-defined types (such as xsd:integer) are simple.

The top-level element declaration for die-tests says that any element with the die-tests tag has the structure and type specified: It is a complex type consisting of a sequence of 0 or more elements that have the tag die-test, and that the content of each such die-test element has the type with the name die-test-type. (It is possible to specify that a given element tag can be have different types in different contexts, but we'll ignore that possibility.)

The definition of the complex type die-test-type specifies that any element declared to have that type (in our case die-test) consists of a sequence of a <name> element, a <when> element, and a <throws> element. The latter is a space-separated list of die6-result values. The definition for the simple type die6-result says that a die-result is an integer in the range 1 through 6.

Validation produces type annotations

To validate an XML document against a schema means to scan the document, verifying that the document satisfies the constraints specified in the schema. The result is a post-schema validation infoset (PSVI), which is an info set (as defined earlier) with additional type annotations. A type annotation is the QName of a type named in a schema.

An XQuery processor may optionally implement the Schema import feature. If it does, it must be able to import definitions from external schemas and validate node trees.

Each element or attribute in XQuery has a type annotation, which is its dynamic type. If an element has not been validated, or otherwise been given a type annotation, then it has the default type annotation xs:anyType. The corresponding default for an attribute node is the type xs:untypedAtomic.

Atomic (non-node) values can also have type annotations. The annotation xsd:untypedAtomic indicates that the type is unknown, typically raw text from an schema-less XML file. Operations that take atomic values may cast xsd:untypedAtomic to a more specific type, such as xs:double, but if the atomic value is of the wrong kind (a string where a number is required, as in the operands of +), then a run-time error may be signaled.

An XQuery application can use a validate expression:

validate ( EXPR )

This takes a sequence of elements, strips off any existing type annotations, and adds type annotations as specified by the in-context scheme definitions. The latter are all the scheme element declarations and type definitions that are imported by schema import declarations. (You can optionally specify a SchemaContext that can be used with context-dependent schema types.)

A schema import declaration appears in the Query Prolog of an XQuery program. For example:

import schema "http://www.w3.org/1999/xhtml" at "xhtml.xsd"

This tells the XQuery processor to look at the location specified (in this case by the relative URL "xhtml.dtd") and add any schema components in the specified namespace (http://www.w3.org/1999/xhtml) to the set of visible schema components. These now become available for validate expressions.

Note that Schema validation and type annotation are conceptually dynamic (run-time) type operations. A type annotation is a QName that is associated with a value, not associated with a static (compile-time) expression. Next we will look at static type-checking.

Sequence Types

The XQuery language provides operations to check whether a value belongs to a type, as well as mechanisms to declare that a variable or parameter has a specific type. In an XQuery program (static) types are instances of SequenceType. We won't go into detail about SequenceType, but here are some examples:

text() — Matches any text node.

element() — Matches any element node.

element(xhtml:td,*) — Matches any element node whose tag has the local part td and has the same namespace URI that xhtml is bound to. (It does not have to have xhtml as the actual namespace prefix.)

element(*, die6-result) — Matches any element of any tag that has a type annotation of die6-result.

element(xhtml:title)? — Matches an optional type element - i.e. zero or one items that match element xhtml:title, and whose type annotation matches that declared for xhtml:title in an imported schema definition.

node()* — Matches a sequence of zero or more nodes.

item()+ — Matches any non-empty sequence.

attribute(@ID, *) — Matches any attribute node whose name is ID (in the empty namespace).

xs:integer — Matches any integer type or any type derived from it, such a xs:nonNegativeInteger, assuming this is in scope of a namespace declaration that binds xs to http://www.w3.org/2001/XMLSchema, which is normally the case.

These types can be used for XQuery's type-checking and -conversion operators. Here is a very brief summary; see the specification or other chapters for details and examples.

expr instance of type — Returns true if the value of expr matches (is an instance of) type.

cast as type (expr) — Convert the value of expr to a given type, using certain standard conversions.

treat as type (expr) — Treat the expr as having static type type. At run-rime, a dynamic error is signaled if the value of expr is not an instance of type.

typeswitch (expr) case type1 return expr1 ... default return exprd — Select the first case whose type matches the value of the expr, and evaluate the corresponding expression.

Static typing

An XQuery implementation may optionally implement the Static Typing Feature. This means that the implementation is required to detect static type errors at analysis (compile) time. At the time of this writing, the specification has a number of unresolved issues, and I don't know of any implementation that actually does implement static typing. (However, some of the precursor languages that inspired XQuery do implement static typing.) For these reasons, plus the fact that the specification of static typing is big and formal, I won't go beyond mentioning a few of the concepts.

The static type system defined in the XQuery formal semantics (http://www.w3.org/TR/query-semantics/) goes far beyond what you can express as a SequenceType. It includes most of the type specification concepts of XML Schema. The formal semantics defines extra declarations define type, define element, and define attribute. These are not in the XQuery source language (i.e. you can't write them directly), but are a formalism used in the formal semantics to express types imported from schemas. The idea is that an XQuery program is translated to core XQuery, which is simpler and more regular (but less convenient) than the actual XQuery program. Part of this translation is that Scheme import declaration are translated into define type, define element, and define attribute declarations. These internal declarations, as well as the whole concept of core XQuery, are purely part of the formal specification of XQuery: There is no requirement that any implementation implement the translation to core XQuery, only that it acts as if it does.

Static type checking is done at the level of core XQuery at analysis (or compile) time. There are a whole slew of rules that say things like if the type of expr1 is xsd:boolean, the type of expr2 is type2, and the type of expr3is type3, then the type of if (expr1) then expr2 else expr3 is (type2|type3). Here (type2|type3) is a type expression in the formal semantics, which you cannot write directly in the actual XQuery language.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, version 1.1.