7.21.3 SSAX: A Functional XML Parsing Toolkit

Guile’s XML parser is based on Oleg Kiselyov’s powerful XML parsing toolkit, SSAX.

7.21.3.1 History

Back in the 1990s, when the world was young again and XML was the solution to all of its problems, there were basically two kinds of XML parsers out there: DOM parsers and SAX parsers.

A DOM parser reads through an entire XML document, building up a tree of “DOM objects” representing the document structure. They are very easy to use, but sometimes you don’t actually want all of the information in a document; building an object tree is not necessary if all you want to do is to count word frequencies in a document, for example.

SAX parsers were created to give the programmer more control on the parsing process. A programmer gives the SAX parser a number of “callbacks”: functions that will be called on various features of the XML stream as they are encountered. SAX parsers are more efficient, but much harder to use, as users typically have to manually maintain a stack of open elements.

Kiselyov realized that the SAX programming model could be made much simpler if the callbacks were formulated not as a linear fold across the features of the XML stream, but as a tree fold over the structure implicit in the XML. In this way, the user has a very convenient, functional-style interface that can still generate optimal parsers.

The xml->sxml interface from the (sxml simple) module is a DOM-style parser built using SSAX, though it returns SXML instead of DOM objects.

7.21.3.2 Implementation

(sxml ssax) is a package of low-to-high level lexing and parsing procedures that can be combined to yield a SAX, a DOM, a validating parser, or a parser intended for a particular document type. The procedures in the package can be used separately to tokenize or parse various pieces of XML documents. The package supports XML Namespaces, internal and external parsed entities, user-controlled handling of whitespace, and validation. This module therefore is intended to be a framework, a set of “Lego blocks” you can use to build a parser following any discipline and performing validation to any degree. As an example of the parser construction, the source file includes a semi-validating SXML parser.

SSAX has a “sequential” feel of SAX yet a “functional style” of DOM. Like a SAX parser, the framework scans the document only once and permits incremental processing. An application that handles document elements in order can run as efficiently as possible. Unlike a SAX parser, the framework does not require an application register stateful callbacks and surrender control to the parser. Rather, it is the application that can drive the framework – calling its functions to get the current lexical or syntax element. These functions do not maintain or mutate any state save the input port. Therefore, the framework permits parsing of XML in a pure functional style, with the input port being a monad (or a linear, read-once parameter).

Besides the port, there is another monad – seed. Most of the middle- and high-level parsers are single-threaded through the seed. The functions of this framework do not process or affect the seed in any way: they simply pass it around as an instance of an opaque datatype. User functions, on the other hand, can use the seed to maintain user’s state, to accumulate parsing results, etc. A user can freely mix their own functions with those of the framework. On the other hand, the user may wish to instantiate a high-level parser: SSAX:make-elem-parser or SSAX:make-parser. In the latter case, the user must provide functions of specific signatures, which are called at predictable moments during the parsing: to handle character data, element data, or processing instructions (PI). The functions are always given the seed, among other parameters, and must return the new seed.

From a functional point of view, XML parsing is a combined pre-post-order traversal of a “tree” that is the XML document itself. This down-and-up traversal tells the user about an element when its start tag is encountered. The user is notified about the element once more, after all element’s children have been handled. The process of XML parsing therefore is a fold over the raw XML document. Unlike a fold over trees defined in [1], the parser is necessarily single-threaded – obviously as elements in a text XML document are laid down sequentially. The parser therefore is a tree fold that has been transformed to accept an accumulating parameter [1,2].

Formally, the denotational semantics of the parser can be expressed as

 parser:: (Start-tag -> Seed -> Seed) ->
	   (Start-tag -> Seed -> Seed -> Seed) ->
	   (Char-Data -> Seed -> Seed) ->
	   XML-text-fragment -> Seed -> Seed
 parser fdown fup fchar "<elem attrs> content </elem>" seed
  = fup "<elem attrs>" seed
	(parser fdown fup fchar "content" (fdown "<elem attrs>" seed))

 parser fdown fup fchar "char-data content" seed
  = parser fdown fup fchar "content" (fchar "char-data" seed)

 parser fdown fup fchar "elem-content content" seed
  = parser fdown fup fchar "content" (
	parser fdown fup fchar "elem-content" seed)

Compare the last two equations with the left fold

 fold-left kons elem:list seed = fold-left kons list (kons elem seed)

The real parser created by SSAX:make-parser is slightly more complicated, to account for processing instructions, entity references, namespaces, processing of document type declaration, etc.

The XML standard document referred to in this module is http://www.w3.org/TR/1998/REC-xml-19980210.html

The present file also defines a procedure that parses the text of an XML document or of a separate element into SXML, an S-expression-based model of an XML Information Set. SXML is also an Abstract Syntax Tree of an XML document. SXML is similar but not identical to DOM; SXML is particularly suitable for Scheme-based XML/HTML authoring, SXPath queries, and tree transformations. See SXML.html for more details. SXML is a term implementation of evaluation of the XML document [3]. The other implementation is context-passing.

The present frameworks fully supports the XML Namespaces Recommendation: http://www.w3.org/TR/REC-xml-names/.

Other links:

[1]

Jeremy Gibbons, Geraint Jones, "The Under-appreciated Unfold," Proc. ICFP’98, 1998, pp. 273-279.

[2]

Richard S. Bird, The promotion and accumulation strategies in transformational programming, ACM Trans. Progr. Lang. Systems, 6(4):487-504, October 1984.

[3]

Ralf Hinze, "Deriving Backtracking Monad Transformers," Functional Pearl. Proc ICFP’00, pp. 186-197.

7.21.3.3 Usage

Scheme Procedure: current-ssax-error-port
Scheme Procedure: with-ssax-error-to-port port thunk
Scheme Procedure: xml-token? _
 -- Scheme Procedure: pair? x
     Return `#t' if X is a pair; otherwise return `#f'.

 
Scheme Syntax: xml-token-kind token
Scheme Syntax: xml-token-head token
Scheme Procedure: make-empty-attlist
Scheme Procedure: attlist-add attlist name-value
Scheme Procedure: attlist-null? x

Return #t if x is the empty list, else #f.

Scheme Procedure: attlist-remove-top attlist
Scheme Procedure: attlist->alist attlist
Scheme Procedure: attlist-fold kons knil lis1
Scheme Procedure: define-parsed-entity! entity str

Define a new parsed entity. entity should be a symbol.

Instances of &entity; in XML text will be replaced with the string str, which will then be parsed.

Scheme Procedure: reset-parsed-entity-definitions!

Restore the set of parsed entity definitions to its initial state.

Scheme Procedure: ssax:uri-string->symbol uri-str
Scheme Procedure: ssax:skip-internal-dtd port
Scheme Procedure: ssax:read-pi-body-as-string port
Scheme Procedure: ssax:reverse-collect-str-drop-ws fragments
Scheme Procedure: ssax:read-markup-token port
Scheme Procedure: ssax:read-cdata-body port str-handler seed
Scheme Procedure: ssax:read-char-ref port
Scheme Procedure: ssax:read-attributes port entities
Scheme Procedure: ssax:complete-start-tag tag-head port elems entities namespaces
Scheme Procedure: ssax:read-external-id port
Scheme Procedure: ssax:read-char-data port expect-eof? str-handler seed
Scheme Procedure: ssax:xml->sxml port namespace-prefix-assig
Scheme Syntax: ssax:make-parser . kw-val-pairs
Scheme Syntax: ssax:make-pi-parser orig-handlers
Scheme Syntax: ssax:make-elem-parser my-new-level-seed my-finish-element my-char-data-handler my-pi-handlers