If you find a mistake, or something is unclear, please email per@bothner.com so I can fix the text.
We have earlier looked at the values (nodes, primitives, and sequences) that XQuery works with. In this article we will look more deeply into the XQuery/XPath data model and type system. On the way we will touch on a fair bit of background material, including XML Schemas and XML infosets.
The XQuery data model is based on the XML Information Set standard (W3C Recommendation 24 October 2001, http://www.w3.org/TR/xml-infoset). It rather abstractly defines the information content of an XML document as a document item that contains nested element items, which in turn contain namespace, attribute, character and other items. This is a conceptual standard: It does not define any file formats or programming interfaces, but rather it defines the interpretation of an XML file. It is intended to be useful for defining other XML-related standards, including the XQuery/XPath data model.
XML files that are different at the character level but that have the same information set or infoset are for most practical purposes equivalent. For example:
<a b='upcase' ><![CDATA[Hello!]]></a>
and
<a b="upcase" >Hello!</a>
have the same information sets.
The "Canonical XML" recommendation is a related standard in that it specifies a unique ("canonical") way to convert an XML infoset (back) into an XML document. Two XML documents that are "logically equivalent" (i.e. have the same infoset) translate to the same canonical XML representation. The Canonical XML for the above example is:
<a b="upcase">Hello!</a>
A parsed XML file results in an infoset, but there can also be synthetic infosets that are constructed from other sources, such as a database, or created by a program that manipulates a DOM (Document Object Model). A DOM is a popular data structure API used to encode and manipulate XML data - i.e. infosets.
The XQuery language also allows you to create infoset items, using element constructor expressions or pre-defined functions.
Node values represent the parts of a XML document, or more generally an XML infoset. Nodes are also used to represent document fragments - i.e. stand-alone nodes that are not part of a document (for example, those that might be generated by an element constructor expression).
These are the kinds of nodes, most of which are as you would expect:
Document nodes represent a complete XML document.
Element nodes present XML elements.
Attribute nodes represent the attributes of an element. Note that namespace definitions are represented by namespace nodes instead.
Namespace nodes represent the in-scope attributes of an element. (You cannot actually "get at" a namespace node in XQuery,
though you get at them in XPath using the namespace::
axis, which has been deprecated in XPath 2.0.)
Processing instruction nodes represent embedded XML processing instructions.
Comment nodes represent XML comments.
Text nodes represent character data. Note that an infoset consists of single-character items, but in the node representation multiple contiguous character items are represented in a single text node.
The XQuery 1.0 and XPath 2.0 Data Model specification (http://www.w3.org/TR/query-datamodel/) goes into details about the different kinds of nodes. It also defines a
number of functions on nodes, using the prefix dm
. For example, the function dm:node-kind
takes a Node
and returns a string value that represents the node's kind, one of
"document"
, "element"
, "attribute"
, "text"
, "namespace"
, "processing-instruction"
, or "comment"
.
Note that these functions are only to explain the data model:
You cannot call them from an XQuery program, and
the prefix dm
isn't actually bound to any namespace.
However, in some cases there may be a
function in the Functions and Operators document
with the same name and behavior. Those are available to
your XQuery programs.
For example there is a fn:node-kind
function which
you can use, and which is defined to return the same
result as dm:node-kind
.
Nodes in XQuery are immutable, which means you cannot change any part of a node once it has been created. This makes sense, since XQuery is a pure side-effect-free expression
language. However, nodes do have identity: two nodes that
were created using different expressions are distinct nodes,
even if they contain the same data.
You can compare the latter using the standard function
fn:deep-equal
.
The following examples are all true
:
fn:deep-equal(<a>test</a>, </a>test</a>) <a>test</a> isnot </a>test</a> let $x := <a><b></b></a> return $x/b is $x/*
Because nodes have identity, you can talk about making a copy of a node, because you can distinguish the original from the copy. However, atomic values do not have identity: There is no
way to copy the string "xyzzy"
because there is no way to distinguish the copy from the original. That is the difference between a string atomic value and a text node, which you can
copy.
A node may have children, which are the nodes below it in the tree hierarchy, but not including attribute or namespace nodes.
For example, the children of an element node are the text nodes and
nested elements (and occasionally other nodes) that are its contents.
The function dm:children
takes a node and returns the sequence of its children.
(This function only exists in the data model; to get the children of a node
N in XQuery use the expression N/node()
.)
Only document and element nodes can have children, so
dm:children
returns the empty sequence for other node types.
For each node X
that is a child of Y
, the node X
has a parent property that is the node Y
.
Using the data model, you can get at Y
using the
dm:parent
function; in an XQuery program
you have to use the expression X/parent::node()
.
These properties have some surprising consequences. Because nodes are immutable, you have to specify the children of an element or document when you create it. However, those children
have to have their parent property set to the new node - but you can't modify them, as they are immutable. This chicken-and-egg problem is solved by creating new copies of the children, with
the parent property of the new nodes set to the new parent. Any children of the children also have to be copied.
Note, however, that this copying of nodes is part of the specification, but an implementation is free to optimize away the copying if it doesn't change the result. For example, consider the following expression:
<a>Some <b/>{<c>text</c>}</a>
The specification says that <b>
and <c>
nodes are created, and then copied when <a>
is created. But since there is no way to access the old
<b>
and <c>
nodes, an implementation is free to just re-use the old nodes without copying them, or it can create them in-place at the same time it creates <a>
.
This is an example of the important difference between specification and (valid) implementation. The lack of side effects in XQuery gives the implementation extra flexibility in choosing how to
implement things. A possible disadvantage is that it makes it hard to estimate how much work is done for an XQuery program, unless you are very familiar with your implementation. On the other hand,
you usually don't need to know.
More generally, an implementation is free to represent nodes in any way compatible with the specification. An obvious choice is to use the standard Node
type specified in the
W3C's Document Object Model (DOM) (http://www.w3.org/DOM). However, though DOM is a flexible and convenient API, it is quite space-inefficient. As an example of an alternative representation,
the Qexo implementation (http://www.gnu.org/software/qexo/) uses a single TreeList
for an entire document. The TreeList
contains an internal array, and node objects are
identified by indexes into that array. (The Apache Xalan XSLT processor uses a similar Document Array Model representation.) In fact, an implementation may in some cases not create actual node
objects at all. Consider that the ultimate result of evaluating an XQuery expression is often written out to a file as a new XML document. In that case the XQuery processor can write out the nodes
on-the-fly directly to the output file, without ever creating any nodes. More generally, the XQuery processor can "write" the output to a SAX DocumentHandler
or a similar event-driven
interface.
Sometimes it is useful to take a node, and convert it to a string value. The function fn:string
does that. The string value of a text node is the characters in the
node. The string value of an element or document node is the concatenation of the text node descendents of the node in document order. The string value of an attribute node is the attribute
value.
There is also the typed value of an element, attribute, or text node, which you can extract using the fn:data
function. This is the value of a node as a sequence of
atomic values, as the result of Scheme validation. If an element node has a complex type, then the typed value is undefined.
The XQuery and XPath languages are typed expression (functional) languages. This means that programs are made from expressions (which may in turn contain sub-expressions), and that evaluating an expression results in a value, which has a type.
Informally, a type is a set of values: those values that are instances of or belong to the type. The type system of a programming language is the collection (vocabulary) of types that the language definition distinguishes, including the rules for determining whether a value is an instance of a type, and for how to create complex types from simple types.
A type error occurs when the operands of an operation have types that are not allowed for that operation. For example, in XQuery you can add two numbers using the +
operator, but you can't add two nodes, even if the nodes contain integer values. If your program tries to add two nodes, the XQuery processor should give you an error message instead.
It is useful to distinguish between the dynamic types and static types:
The type of a value is a dynamic type. Dynamic types exist during evaluation (at run-time). Dynamic types are sets of values that are instances of the type. A type specifies the meaning or interpretation of a value.
The type of an expression (a program fragment) is a static type. Static types are the types of declarations and program fragments as specified by the programmer or inferred by a compiler. If an expression has a (static) type and you evaluate the expression without a run-time error, then the result is guaranteed to be an instance of the corresponding static type.
A dynamically typed language is one that doesn't have static types. Another way to say the same thing is that there is only a single type, which contains all values. In those languages, all type errors are run-time errors. The goal of a static type system is to detect type errors at compile time, before actual execution. This is a process called type checking, and in some languages (including XQuery) is a fairly complicated process.
Static type checking lets you detect and fix errors earlier. This is especially valuable for infrequently executed parts of a program, since they are less likely to get much testing. As a side benefit, if the compiler can determine the type of an expression, it may be able to generate more efficient code, and so the query may execute faster.
The XQuery and XPath languages specify both dynamic types and static types. The static type checking is optional, both for implementors and users: An XQuery implementation need not implement the static typing feature, and implementations that do implement static typing will have an option to disable it.
We will discuss static typing later, but first we will study dynamic typing, including the kinds of values that XQuery and XPath deal with. The data model is part of dynamic typing.
The values worked on by an XQuery program are sequences of items. An item is either an atomic value (for example an integer or a string) or a node (for example an element or an attribute).
A sequence is a collection of zero or more items. The most important idea to note is that not only are all sequences values, but also all values are sequences, because a sequence of just a single value is in all respects the same as the single value. It follows from this that you cannot nest sequences - you cannot have sequences of sequences, only flat single-level sequences.
If you have experience with arrays or lists in other programming languages, you might think it is a strange and limiting restriction that you can't nest sequences. Actually, it isn't really a limitation, because you can always uses nested elements if you need nested data. For example, to represent a two-dimensional array you can use nested elements like this:
<list> <list>11 12</list> <list>21 22</list> </list>
A major difference between XPath 1 and XPath 2 is that the latter has sequences, while the former does not. Instead, XPath 1 has node sets, which are like sequences, but without duplicates, and in unspecified order. XPath 1 path expressions evaluate to node sets, while in XPath 2 (and XQuery) path expressions evaluate to node sequences. However, the latter sequences are defined to be sorted in document order and with duplicates removed. (These are actually equivalent, in that you can map a set into a sequence that is ordered and without duplicates, and back again, without information loss. Furthermore, any valid XPath 1 expression will behave the same under either model.)
The XQuery/XPath primitive types are the same as in XML Schema, which is a standard for specifying element structure of XML data, and associating types with XML data.
Atomic values include numbers, values, and booleans. There are two kinds of atomic type:
A primitive type is not defined in terms of some other type.
A derived type is based on some other type, its base type.
A derived atomic type is a restriction of its base type, because it is a restriction (sub-set) of the set of atomic values that belong to the base type.
Following is a complete list of the built-in types defined by XML Schema. We will only list them briefly; for more information see the W3C Recommendation (02 May 2001) of XML Schema Part 2: Datatypes (http://www.w3.org/TR/xmlschema-2/). This specifies for each type its value space (the abstract values that belong to the type), its lexical space (the text representation of values using printable characters), and its facets (properties of the type itself).
XML Schema defines the following builtin types:
A boolean
is one of the two truth values true
and false
.
A string
is zero or more Unicode characters.
There a various sub-types of string
: A normalizedString
is a string
that does not have any whitespace characters except for space. A token
is a normalizedString
that has no leading or trailing spaces, and does not have two or more spaces in a row. A language
is a token
used to specify a natural (human) language. An
NMTOKEN
is a token
consisting of one or more NameChar
characters, as defined in the XML standard. A Name
is a token
used to represent XML names, such as body
or html:table
. An NCName
is a plain Name
without a colon, such as body
. The types ID
, IDREF
, and ENTITY
are sub-types of NCName
used for
special kinds of attribute values as specified in the XML standard. The types IDS
, IDREFS
, ENTITIES
, and NMTOKENS
are used for space-separated sequences of the
corresponding tokens.
A NOTATION
is used for attributes that specify the notation (encoding) of an element. However, NOTATION
is not a sub-type of string
.
An anyURI
represents a Uniform Resource Identifier Reference, such as http://www.w3.org/.
A QName
represents an XML qualified name, which is a pair of a namespace name, and a local part. Note that in the value space a namespace name is an
anyURI
(such as http://www.w3.org/1999/xhtml), while the lexical representation uses namespace prefixes (as in xhtml:body
). Therefore mapping between the two requires a context
that contains the needed namespace declaration.
A decimal
is an arbitrary-precision real number, in base 10. An integer
is a decimal
without a fractional part. The types nonPositiveInteger
,
negativeInteger
, nonNegativeInteger
, and positiveInteger
are the obvious sub-types of integer
. The types long
, int
, short
, byte
,
unsignedLong
, unsignedInt
, unsignedShort
, and unsignedByte
are sub-types of integer
that can be encoded in binary using respectively 64, 32, 16, or 8 bits.
The types float
and double
correspond to 32-bit and 64-bit IEEE binary floating-point real numbers. The standard lexical representation uses decimal
format, with an optional exponent, such as -58.45
or 1.25e-10
, even though these types are not sub-types of decimal
.
There are a number of time-related types: A date
is a calendar date, such as May 31, 1999 (written 1999-05-31
). A time
is an instant that
occurs every day, like 1:20pm (written 13:20
). A dateTime
is a specific instant of time, like 1:20pm on May 31 1999 (written as 1999-05-31T13:20
). Any of these may have an
optional timezone specified. A gYear
is a specific year in the Gregorian calendar, while a gMonthYear
is a specific year and month. A gMonthDay
is a month and day that recurs
every year, a gMonth
is a month that recurs every year, and a gDay
is a day that recurs every month. A duration
is a duration of time, like 2 days and 1 hour (written as
P2D1H
). The XQuery/XPath committee has added two sub-types of duration
, which may get added to future Schema revisions: xdt:yearMonthDuration
(a duration of some number of years and
months) and xdt:dayTimeDuration
(a duration of some number of days, hours, minutes, and seconds).
The types hexBinary
and base64Binary
are used to encode arbitrary binary data. The value of either is zero or more octets (8-bit bytes). A
hexBinary
uses two hexadecimal digits for each octet, so 0FB7
encodes the 16-bit integer 4023. A base64Binary
uses the Base64 MIME Content-Transfer-Encoding.
The union of all primitive types is anySimpleType
.
All of these standard types names are in the http://www.w3.org/2001/XMLSchema
namespace, conventionally written using the predefined namespace prefix xs
, as in xs:string
.
The XQuery specication adds four types:
the duration types xdt:yearMonthDuration
and
xdt:dayTimeDuration
are mentioned above;
xdt:anyAtomicType
includes all the atomic values;
and xdt:untypedAtomic
is a type used for untyped data,
such as text that has not been validated.
All are subtypes of anySimpleType
, and are
in the http://www.w3.org/2003/05/xpath-datatypes
namespace.
The word schema comes from the database community, and means a description of the structure, types, and relations of a database. In the XML world a schema is a description of the syntax and meaning (types) of a class of XML documents. A schema language is a formalism for specifying the types of documents as schemas.
The earliest XML schema language is DTD (Document Type Descriptor), which appears in the original XML specification from 1997, and goes back to the SGML roots of XML. DTD is a simple language that lets you express simple structural constraints. For example, the following:
<!ELEMENT tr td*>
means that a <tr>
element consists of zero or more <td>
elements.
DTD does not have any mechanism for specifying semantic or type information, except in a very few cases. Other schema definition languages allow you to define and specify types.
XML Schema (http://www.w3.org/XML/Schema) is a 2001 specification from W3C that can be used to specify structural constrains and associate type information with XML documents. While there are other Scheme language in use, this is the one with most usage and visibility, partly because it is a W3C standard. The type semantics of XQuery/XPath2 are defined in terms of XML Schema.
As an example we will use the record of a series of dice throws. Perhaps you want to verify the dice are fair, or you want a source of random numbers, or you want search for mystical patterns.
<?xml version="1.0"?> <die-tests> <die-test> <who>Nathan</who> <when>whenever</when> <throws>5 2 2 2 1 3 6 6 2 6</throws> </die-test> <die-test> <who>Per</who> <when>2002-10-09T09:07</when> <throws>6 2 5 2 2 3 3 3 4 1</throws> </die-test> </die-tests>
The Schema for this might look like the following:
<?xml version="1.0"?> <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:element name="die-tests"> <xsd:complexType> <xsd:sequence> <xsd:element name="die-test" type="die-test-type" minOccurs="0" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:complexType name="die-test-type"> <xsd:sequence> <xsd:element name="who" type="xsd:string"/> <xsd:element name="when" type="xsd:dateTime"/> <xsd:element name="throws"> <xsd:simpleType> <xsd:list itemType="die6-result"/> </xsd:simpleType> </xsd:element> </xsd:sequence> </xsd:complexType> <xsd:simpleType name="die6-result"> <xsd:restriction base="xsd:integer"> <xsd:minInclusive value="1"/> <xsd:maxInclusive value="6"/> </xsd:restriction> </xsd:simpleType> </xsd:schema>
This is verbose, and may be a bit intimidating, but it is relatively straightforward. It contains a top-level element declaration for the root element <die-tests>
,
as well as type definitions for types named die-test-type
and die6-result
.
A simple type (such as die6-result
)
can only be expressed as character data.
A complex type (such as die-test-type
)
can consist of attribute specifications,
and either sub-elements (complexContent
),
character data (simpleContent
),
or a mixture of these (complexContent, mixed="true"
).
The type of an attribute can
only be a simple type, while the type of an element can be either a simple type (if it only contains text data), or it can be a complex type. All pre-defined types (such as xsd:integer
) are
simple.
The top-level element declaration for die-tests
says that any element with the die-tests
tag has the structure and type specified: It is a complex type consisting of a
sequence of 0 or more elements that have the tag die-test
, and that the content of each such die-test
element has the type with the name die-test-type
. (It is possible to specify
that a given element tag can be have different types in different contexts, but we'll ignore that possibility.)
The definition of the complex type die-test-type
specifies that any element declared to have that type (in our case die-test
) consists of a sequence of a
<name>
element, a <when>
element, and a <throws>
element. The latter is a space-separated list of die6-result
values. The definition for the simple type
die6-result
says that a die-result
is an integer
in the range 1 through 6.
To validate an XML document against a schema means to scan the document, verifying that the document satisfies the constraints specified in the schema. The result is a post-schema validation infoset (PSVI), which is an info set (as defined earlier) with additional type annotations. A type annotation is the QName of a type named in a schema.
An XQuery processor may optionally implement the Schema import feature. If it does, it must be able to import definitions from external schemas and validate node trees.
Each element or attribute in XQuery has a type annotation, which is its dynamic type. If an element has not been validated, or otherwise been given a type annotation, then
it has the default type annotation xs:anyType
. The corresponding
default for an attribute node is the type xs:untypedAtomic
.
Atomic (non-node) values can also have type annotations. The annotation xsd:untypedAtomic
indicates that the type is unknown, typically
raw text from an schema-less XML file. Operations that take atomic values may
cast xsd:untypedAtomic
to a more specific type, such as xs:double
, but if the atomic value is of the wrong kind (a string where a number is required, as in the operands of +
),
then a run-time error may be signaled.
An XQuery application can use a validate expression:
validate ( EXPR )
This takes a sequence of elements, strips off any existing type annotations, and adds type annotations as specified by the in-context scheme definitions. The latter are all the scheme element declarations and type definitions that are imported by schema import declarations. (You can optionally specify a SchemaContext that can be used with context-dependent schema types.)
A schema import declaration appears in the Query Prolog of an XQuery program. For example:
import schema "http://www.w3.org/1999/xhtml" at "xhtml.xsd"
This tells the XQuery processor to look at the location specified (in this case by the relative URL "xhtml.dtd
") and add any schema components in the specified namespace
(http://www.w3.org/1999/xhtml) to the set of visible schema components. These now become available for validate
expressions.
Note that Schema validation and type annotation are conceptually dynamic (run-time) type operations. A type annotation is a QName that is associated with a value, not associated with a static (compile-time) expression. Next we will look at static type-checking.
The XQuery language provides operations to check whether a value belongs to a type, as well as mechanisms to declare that a variable or parameter has a specific type. In an XQuery
program (static) types are instances of SequenceType
. We won't go into detail about SequenceType
, but here are some examples:
text()
— Matches any text node.
element()
— Matches any element node.
element(xhtml:td,*)
— Matches any element node whose tag has the local part td
and has the same namespace URI that xhtml
is bound to. (It does not have to have
xhtml
as the actual namespace prefix.)
element(*, die6-result)
— Matches any element of any tag that has a type annotation of die6-result
.
element(xhtml:title)?
— Matches an optional type
element - i.e. zero or one items that match element
xhtml:title
, and whose type annotation matches
that declared for xhtml:title
in an imported schema
definition.
node()*
— Matches a sequence of zero or more nodes.
item()+
— Matches any non-empty sequence.
attribute(@ID, *)
— Matches any attribute node whose name is ID
(in the empty namespace).
xs:integer
— Matches any integer type or any type derived from it, such a xs:nonNegativeInteger
, assuming this is in scope of a namespace declaration that binds
xs
to http://www.w3.org/2001/XMLSchema
, which is normally the case.
These types can be used for XQuery's type-checking and -conversion operators. Here is a very brief summary; see the specification or other chapters for details and examples.
expr instance of type
— Returns true if the value of expr
matches (is an instance of) type
.
cast as type (expr)
— Convert the value of expr
to a given type
, using certain standard conversions.
treat as type (expr)
— Treat the expr
as having static type type
. At run-rime, a dynamic error is signaled if the value of expr
is not an instance of
type.
typeswitch (expr) case type1 return expr1 ... default return exprd
— Select the first case
whose type matches the value of the expr
, and evaluate the
corresponding expression.
An XQuery implementation may optionally implement the Static Typing Feature. This means that the implementation is required to detect static type errors at analysis (compile) time. At the time of this writing, the specification has a number of unresolved issues, and I don't know of any implementation that actually does implement static typing. (However, some of the precursor languages that inspired XQuery do implement static typing.) For these reasons, plus the fact that the specification of static typing is big and formal, I won't go beyond mentioning a few of the concepts.
The static type system defined in the XQuery formal semantics (http://www.w3.org/TR/query-semantics/) goes far beyond what you can express as a SequenceType
. It includes
most of the type specification concepts of XML Schema. The formal semantics defines extra declarations define type
, define element
, and define attribute
. These are not in
the XQuery source language (i.e. you can't write them directly), but are a formalism used in the formal semantics to express types imported from schemas. The idea is that an XQuery program is
translated to core XQuery, which is simpler and more regular (but less convenient) than the actual XQuery program. Part of this translation is that Scheme import declaration are translated
into define type
, define element
, and define attribute
declarations. These internal declarations, as well as the whole concept of core XQuery, are purely part of the formal
specification of XQuery: There is no requirement that any implementation implement the translation to core XQuery, only that it acts as if it does.
Static type checking is done at the level of core XQuery at analysis (or compile) time. There are a whole slew of rules that say things like if the type of expr1
is
xsd:boolean
, the type of expr2
is type2
, and the type of expr3
is type3
, then the type of if (expr1) then expr2 else expr3
is (type2|type3)
. Here (type2|type3)
is a type expression in the formal semantics, which you cannot write directly in the actual XQuery language.
Copyright 2002, 2003 (C) Per Bothner. <per@bothner.com>
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, version 1.1.