Yaxpo: Yet Another XML Parser for O'Caml

Mike Lin
mikelin (at) mit.edu

February 2007: I have once again revived this web page for anyone who wants to use Yaxpo. There is no realistic prospect that it will be significantly enhanced in the near future. However, the parser works and I have been using it for five years.

October 2010: Actually some additional development is happening (slowly) at GitHub. I'm still using the parser...

Yaxpo is a hand-written nonvalidating XML 1.0 + XMLNS processing suite for Objective Caml, designed for flexibility in new and unusual applications of XML. The Yaxpo parser provides all three pull, push (SAX-like), and accumulate (DOM-like) interfaces. The latter of these provides a data model complete with DOM-like constructors, accessors, mutators, and serializers, enabling you to manipulate and emit XML as well as parse it.

All of the input, callback, and descent operations in the parser are explicitly structured in the functional continuation-passing style (CPS) instead of recursion or iteration. This allows the programmer to do some interesting things in clever ways. One specific motivation was for Jabber, an open-source messaging system in which XML data is streamed across a socket during a long session. Yaxpo's CPS constructs allow the parser to be arbitrarily started and stopped whenever there is data available on the socket, without blocking and without imposing an event-driven model on the driver program.

Yaxpo currently has acceptable support for the UTF-8 encoding of Unicode only. The parser will properly recognize UTF-8 characters, and reject invalid characters. The data representation remains as byte strings of possibly multibyte UTF-8 encoded characters. It would be possible without an unreasonable amount of effort to extend the parser to read multiple encodings; however, I have not done this.

Yaxpo has no dependencies that are not included in recent versions of O'Caml.


Yaxpo Package

The latest tarball for Yaxpo can be found here. It is dated 16-Aug-2002. The parser should still be considered "alpha" code, and interfaces may still change in the future. There are still many places where code could use some performance optimization.

To build Yaxpo, simply extract the source tree and run make. The libraries and interfaces will be copied to the built directory. The libraries to link with your bytecode- and native-compiled programs are yaxpo.cma and yaxpo.cmxa, respectively. These files can be automatically put into your O'Caml libraries directory by running make install as superuser.

There are some examples below, and the OCamldoc can be viewed here. The documentation can also be built from the sources if you have OCamldoc installed; this is done with make doc.

Yaxpo provides the following modules:

Module Description
Yaxpo API
Cps_reader Abstract data reader interface, and several implementations thereof
Yaxpo XML lexing and "pull" parser
Yaxpodom DOM-like data model and support routines
Yaxposax SAX-like callback parser
Supporting modules
U_char Unicode character abstraction
Utf8_char UTF-8 character abstraction, and decoding
Utf8_lit UTF-8 literals used by the parser

Data Model

The Yaxpodom module defines a simple data model that allows the direct construction of XML data trees from primitive types, and the use of O'Caml pattern matching in XML processing.

type txt = string
type cdata = string

type qname =
    QName of txt*txt*txt                                (* (prefix,localPart,nsURI) *)

type att =
    Att of qname*txt ref                                (* (name,value) *)

type ele =
    Element of qname*att list ref*content list ref      (* (name,atts,children) *)

and content =
    Text of txt
  | CDATA of cdata
  | Comment of txt
  | PI of txt*txt                                       (* (target,value) *)
  | Subele of ele

type doc = content list ref*ele ref                     (* (prolog,docEle) *)

Certain types are ref types to allow mutation if it is desired. This makes it slightly annoying to have to remember to ref and ! in the right places, but mutation is useful for many applications.

Along with this data model are a set of accessors, mutators, serializers, and parsers that assist in its manipulation and conversion to and from XML tagged text.


Examples

1. Yaxposax

Here is a program that uses the Yaxposax parser to print a trace of XML data as it is parsed:

open Yaxpo;;
open Yaxposax;;

let trace s = print_string s; print_char '\n';;

let parse_by_sax (xml:string) =
  let reader = Cps_reader.make_utf8_string_reader xml in
  let callbacks =
    {
      start_element =
       (fun name atts kont ->
	  trace ("Start Element: " ^ (string_of_qname name));
	  List.iter
	    (fun an_att -> trace ("With att: ^ " ^ (string_of_att an_att)))
	    atts;
	  kont ());
      characters =
       (fun str kont ->
	  trace ("Characters: " ^ str);
	  kont ());
      cdata =
       (fun data kont ->
	  trace ("CDATA section: " ^ data);
	  kont ());
      comment =
       (fun str kont ->
	  trace ("Comment: " ^ str);
	  kont ());
      pi =
       (fun target value kont ->
	  trace ("PI target: " ^ target ^ " / value: " ^ value);
	  kont ());
      end_element =
       (fun name kont ->
	  trace ("End Element: " ^ (string_of_qname name));
	  kont ())
    } in
    parse_document reader callbacks (fun () -> trace "Parsing Finished!");;

parse_by_sax "<?xml version='1.0'?><element>some text<!--a comment--></element>";;

This produces the output:

PI target: xml / value: version='1.0'
Start Element: element
Characters: some text
Comment: a comment
End Element: element
Parsing Finished!

The parse_by_sax function creates a reader object on the string it is passed, then calls Yaxposax.parse_document, passing in a record filled in with callback functions. Yaxpo parses the document and calls the appropriate function as it sees each token.

This example reads in data from a string. The Cps_reader module also provides a file reader. It is possible and even easy to implement your own reader for any data source.

It is important to carefully understand the continuation-passing style (CPS) in which the callbacks are invoked. Each callback is passed an additional argument besides what you would expect: a continuation function of type (unit -> unit), which in the above example are called kont. When the callback is finished processing, instead of returning to its caller, it issues a tail-call to its continuation. The body of the continuation function resumes parsing. If a callback hypothetically wanted to halt parsing, it would simply not invoke its continuation; control would then immediately return back to the caller of parse_by_sax. However, and this is the interesting part, the continuation could have been saved in some global state, and then invoked later, "resurrecting" the parser!

Footnote: The semantics of invoking the continuation are not quite the same as they are in a language with first-class continuations and call/cc like Scheme or SML, since O'Caml only has closures to work with as "fake" continuations. In Yaxpo's CPS, if you issue a non-tail call to the continuation, control eventually returns to you; in Scheme or SML, invoking a continuation is an irreversible goto. Irreversible, that is, except by another continuation. :-)

Note that parse_document itself expects a continuation, which it invokes when it is done processing.

2. Yaxpodom

Here is a program that uses the Yaxpodom parser to load an element into a data structure:

open Yaxpodom;;

let trace s = print_string s; print_char '\n';;

let parse_by_dom (element:string) =
  let reader = Cps_reader.make_utf8_string_reader element in
  let rec print_ele the_ele =
    trace ("Start Element: " ^ (string_of_qname (ele_name the_ele)));
    trace ("with namespace: " ^ (qname_nsURI (ele_name the_ele)));
    List.iter (fun an_att -> trace ("with att: " ^ (string_of_att an_att))) (ele_atts the_ele);
    List.iter print_content (ele_children the_ele);
    trace ("End Element: " ^ (string_of_qname (ele_name the_ele)))
  and print_content =
    (function
	 Text(str) -> trace ("Text: " ^ str)
       | CDATA(dat) -> trace ("CDATA: " ^ dat)
       | Comment(str) -> trace ("Comment: " ^ str)
       | PI(target,value) -> trace ("PI target: " ^ target ^ " / value: " ^ value)
       | Subele(an_ele) -> print_ele an_ele)
  in
    print_ele (doc_ele (build_doc_tree reader));;

parse_by_dom "<ele xmlns='some-URI'>Hello!<pfx:ele xmlns:pfx='other-URI' att='value'/></ele>";;

This produces the output:

Start Element: ele
with namespace: some-URI
with att: xmlns="some-URI"
Text: Hello!
Start Element: pfx:ele
with namespace: other-URI
with att: xmlns:pfx="other-URI"
with att: att="value"
End Element: pfx:ele
End Element: ele

We simply write a recursive function that prints out the content of an element with the help of pattern matching, then we pass it the data structure yielded by Yaxpodom. Piece of cake! But don't worry, it's all CPS underneath :-)

This example also demonstrates Yaxpodom's namespace processing support. As it parses the XML, it resolves all namespace prefixes into the corresponding URIs, and fills them in as an extra field in the QName of each element an attribute.

3. Constructors and Serializers

Here is a program that constructs some data from scratch, and then emits the corresponding XML:

open Yaxpodom;;

let the_ele =
  Element(
    QName("","ele",""),
    ref [],
    ref [ Text("Hello");
	  Comment("World");
	  Subele(
	    Element(
	      QName("pfx","ele",""),
	      ref [ Att(QName("xmlns","pfx",""), ref "some-URI") ],
	      ref [])) ]);;


print_string (string_of_ele the_ele);;

This produces the output:

<ele>Hello<!--World--><pfx:ele xmlns:pfx="some-URI"/></ele>

While the syntax tends to be a bit messy, and makes you worry about parentheses almost as much as Lisp, in my opinion this is a lot nicer than splicing together strings, or calling lots of methods on lots of objects. A tree in this data model is conceptually similar to Oleg Kiselyov's SXML or PLT's X-expr system, both for Scheme. The Yaxpo system is strongly typed, which costs it dearly in syntactic unity, but on the other hand, it is considerably more difficult to screw up.

This example illustrates one small but important point: the nsURI part of each QName (the third argument to the QName constructor) is provided only as a convenience during parsing. The serializers and other methods all ignore it, so it is still your responsibility to emit the xmlns declaration attributes. It would not be terribly difficult to have the serializers compute which declarations are needed given a tree decorated with nsURIs, but I have yet to sit down and implement this.


mikelin (at) mit.edu
Last modified: Fri Aug 16 12:41:56 EDT 2002