Next: Printing Scheme, Previous: URI, Up: Textual Conversion Packages [Contents][Index]
(require 'xml-parse)
or (require 'ssax)
The XML standard document referred to in this module is
http://www.w3.org/TR/1998/REC-xml-19980210.html.
The present frameworks fully supports the XML Namespaces
Recommendation
http://www.w3.org/TR/REC-xml-names.
Given the list of fragments (some of which are text strings),
reverse the list and concatenate adjacent text strings. If
LIST-OF-FRAGS has zero or one element, the result of the procedure
is equal?
to its argument.
Given the list of fragments (some of which are text strings), reverse the list and concatenate adjacent text strings while dropping "unsignificant" whitespace, that is, whitespace in front, behind and between elements. The whitespace that is included in character data is not affected.
Use this procedure to "intelligently" drop "insignificant"
whitespace in the parsed SXML. If the strict compliance with the
XML Recommendation regarding the whitespace is desired, use the
ssax:reverse-collect-str
procedure instead.
The following functions either skip, or build and return tokens, according to inclusion or delimiting semantics. The list of characters to expect, include, or to break at may vary from one invocation of a function to another. This allows the functions to easily parse even context-sensitive languages.
Exceptions are mentioned specifically. The list of expected characters (characters to skip until, or break-characters) may include an EOF "character", which is coded as symbol *eof*
The input stream to parse is specified as a PORT, which is the last argument.
Reads a character from the port and looks it up in the char-list of expected characters. If the read character was found among expected, it is returned. Otherwise, the procedure writes a message using string as a comment and quits.
Reads characters from the port and disregards them, as long as they are mentioned in the char-list. The first character (which may be EOF) peeked from the stream that is not a member of the char-list is returned.
Returns an initial buffer for ssax:next-token*
procedures.
ssax:init-buffer
may allocate a new buffer at each invocation.
Skips any number of the prefix characters (members of the prefix-char-list), if any, and reads the sequence of characters up to (but not including) a break character, one of the break-char-list.
The string of characters thus read is returned. The break character
is left on the input stream. break-char-list may include the symbol *eof*
;
otherwise, EOF is fatal, generating an error message including a
specified comment-string.
ssax:next-token-of
is similar to ssax:next-token
except that it implements an inclusion rather than delimiting
semantics.
Reads characters from the port that belong to the list of characters inc-charset. The reading stops at the first character which is not a member of the set. This character is left on the stream. All the read characters are returned in a string.
Reads characters from the port for which pred (a procedure of one argument) returns non-#f. The reading stops at the first character for which pred returns #f. That character is left on the stream. All the results of evaluating of pred up to #f are returned in a string.
pred is a procedure that takes one argument (a character or the EOF object) and returns a character or #f. The returned character does not have to be the same as the input argument to the pred. For example,
(ssax:next-token-of (lambda (c) (cond ((eof-object? c) #f) ((char-alphabetic? c) (char-downcase c)) (else #f))) (current-input-port))
will try to read an alphabetic token from the current input port, and return it in lower case.
Reads len characters from the port, and returns them in a string. If EOF is encountered before len characters are read, a shorter string will be returned.
TAG-KIND
A symbol ‘START’, ‘END’, ‘PI’, ‘DECL’, ‘COMMENT’, ‘CDSECT’, or ‘ENTITY-REF’ that identifies a markup token
UNRES-NAME
a name (called GI in the XML Recommendation) as given in an XML
document for a markup token: start-tag, PI target, attribute name.
If a GI is an NCName, UNRES-NAME is this NCName converted into a
Scheme symbol. If a GI is a QName, ‘UNRES-NAME’ is a pair of
symbols: (PREFIX . LOCALPART)
.
RES-NAME
An expanded name, a resolved version of an ‘UNRES-NAME’. For
an element or an attribute name with a non-empty namespace URI,
‘RES-NAME’ is a pair of symbols,
(URI-SYMB . LOCALPART)
.
Otherwise, it’s a single symbol.
ELEM-CONTENT-MODEL
A symbol:
anything goes, expect an END tag.
no content, and no END-tag is coming
no content, expect the END-tag as the next token
expect character data only, and no children elements
URI-SYMB
A symbol representing a namespace URI – or other symbol chosen by
the user to represent URI. In the former case, URI-SYMB
is
created by %-quoting of bad URI characters and converting the
resulting string into a symbol.
NAMESPACES
A list representing namespaces in effect. An element of the list has one of the following forms:
(prefix uri-symb . uri-symb) or
(prefix user-prefix . uri-symb)
user-prefix is a symbol chosen by the user to represent the URI.
(#f user-prefix . uri-symb)
Specification of the user-chosen prefix and a URI-SYMBOL.
(*DEFAULT* user-prefix . uri-symb)
Declaration of the default namespace
(*DEFAULT* #f . #f)
Un-declaration of the default namespace. This notation represents overriding of the previous declaration
A NAMESPACES list may contain several elements for the same prefix. The one closest to the beginning of the list takes effect.
ATTLIST
An ordered collection of (NAME . VALUE) pairs, where NAME is a RES-NAME or an UNRES-NAME. The collection is an ADT.
STR-HANDLER
A procedure of three arguments: string1 string2 seed returning a new seed. The procedure is supposed to handle a chunk of character data string1 followed by a chunk of character data string2. string2 is a short string, often ‘"\n"’ and even ‘""’.
ENTITIES
An assoc list of pairs:
(named-entity-name . named-entity-body)
where named-entity-name is a symbol under which the entity was declared, named-entity-body is either a string, or (for an external entity) a thunk that will return an input port (from which the entity can be read). named-entity-body may also be #f. This is an indication that a named-entity-name is currently being expanded. A reference to this named-entity-name will be an error: violation of the WFC nonrecursion.
XML-TOKEN
This record represents a markup, which is, according to the XML Recommendation, "takes the form of start-tags, end-tags, empty-element tags, entity references, character references, comments, CDATA section delimiters, document type declarations, and processing instructions."
a TAG-KIND
an UNRES-NAME. For XML-TOKENs of kinds ’COMMENT and ’CDSECT, the head is #f.
For example,
<P> => kind=START, head=P </P> => kind=END, head=P <BR/> => kind=EMPTY-EL, head=BR <!DOCTYPE OMF ...> => kind=DECL, head=DOCTYPE <?xml version="1.0"?> => kind=PI, head=xml &my-ent; => kind=ENTITY-REF, head=my-ent
Character references are not represented by xml-tokens as these references are transparently resolved into the corresponding characters.
XML-DECL
The record represents a datatype of an XML document: the list of declared elements and their attributes, declared notations, list of replacement strings or loading procedures for parsed general entities, etc. Normally an XML-DECL record is created from a DTD or an XML Schema, although it can be created and filled in in many other ways (e.g., loaded from a file).
an (assoc) list of decl-elem or #f. The latter instructs the parser to do no validation of elements and attributes.
declaration of one element:
(elem-name elem-content decl-attrs)
elem-name is an UNRES-NAME for the element.
elem-content is an ELEM-CONTENT-MODEL.
decl-attrs is an ATTLIST
, of
(attr-name . value)
associations.
This element can declare a user procedure to handle parsing of an element (e.g., to do a custom validation, or to build a hash of IDs as they’re encountered).
an element of an ATTLIST
, declaration of one attribute:
(attr-name content-type use-type default-value)
attr-name is an UNRES-NAME for the declared attribute.
content-type is a symbol: CDATA
, NMTOKEN
,
NMTOKENS
, … or a list of strings for the enumerated
type.
use-type is a symbol: REQUIRED
, IMPLIED
, or
FIXED
.
default-value is a string for the default value, or #f if not given.
These procedures deal with primitive lexical units (Names,
whitespaces, tags) and with pieces of more generic productions.
Most of these parsers must be called in appropriate context. For
example, ssax:complete-start-tag
must be called only when the
start-tag has been detected and its GI has been read.
Skip the S (whitespace) production as defined by
[3] S ::= (#x20 | #x09 | #x0D | #x0A)
ssax:skip-s
returns the first not-whitespace character it encounters while
scanning the port. This character is left on the input stream.
Read a NCName starting from the current position in the port and return it as a symbol.
[4] NameChar ::= Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender [5] Name ::= (Letter | '_' | ':') (NameChar)*
This code supports the XML Namespace Recommendation REC-xml-names, which modifies the above productions as follows:
[4] NCNameChar ::= Letter | Digit | '.' | '-' | '_' | CombiningChar | Extender [5] NCName ::= (Letter | '_') (NCNameChar)*
As the Rec-xml-names says,
"An XML document conforms to this specification if all other tokens [other than element types and attribute names] in the document which are required, for XML conformance, to match the XML production for Name, match this specification’s production for NCName."
Element types and attribute names must match the production QName, defined below.
Read a (namespace-) Qualified Name, QName, from the current position in port; and return an UNRES-NAME.
From REC-xml-names:
[6] QName ::= (Prefix ':')? LocalPart [7] Prefix ::= NCName [8] LocalPart ::= NCName
This procedure starts parsing of a markup token. The current position in the stream must be ‘<’. This procedure scans enough of the input stream to figure out what kind of a markup token it is seeing. The procedure returns an XML-TOKEN structure describing the token. Note, generally reading of the current markup is not finished! In particular, no attributes of the start-tag token are scanned.
Here’s a detailed break out of the return values and the position in the PORT when that particular value is returned:
only PI-target is read. To finish the Processing-Instruction and
disregard it, call ssax:skip-pi
. ssax:read-attributes
may be useful as well (for PIs whose content is attribute-value
pairs).
The end tag is read completely; the current position is right after the terminating ‘>’ character.
is read and skipped completely. The current position is right after ‘-->’ that terminates the comment.
The current position is right after ‘<!CDATA[’. Use
ssax:read-cdata-body
to read the rest.
We have read the keyword (the one that follows ‘<!’) identifying this declaration markup. The current position is after the keyword (usually a whitespace character)
We have read the keyword (GI) of this start tag. No attributes are
scanned yet. We don’t know if this tag has an empty content either.
Use ssax:complete-start-tag
to finish parsing of the token.
The current position is inside a PI. Skip till the rest of the PI
The current position is right after reading the PITarget. We read the body of PI and return is as a string. The port will point to the character right after ‘?>’ combination that terminates PI.
[16] PI ::= '<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>'
The current pos in the port is inside an internal DTD subset (e.g., after reading ‘#\[’ that begins an internal DTD subset) Skip until the ‘]>’ combination that terminates this DTD.
This procedure must be called after we have read a string ‘<![CDATA[’ that begins a CDATA section. The current position must be the first position of the CDATA body. This function reads lines of the CDATA body and passes them to a str-handler, a character data consumer.
str-handler is a procedure taking arguments: string1, string2,
and seed. The first string1 argument to str-handler never
contains a newline; the second string2 argument often will.
On the first invocation of str-handler, seed is the one passed to ssax:read-cdata-body
as the
third argument. The result of this first invocation will be passed
as the seed argument to the second invocation of the line
consumer, and so on. The result of the last invocation of the str-handler is
returned by the ssax:read-cdata-body
. Note a similarity to the fundamental fold
iterator.
Within a CDATA section all characters are taken at their face value, with three exceptions:
[66] CharRef ::= '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';'
This procedure must be called after we we have read ‘&#’ that introduces a char reference. The procedure reads this reference and returns the corresponding char. The current position in PORT will be after the ‘;’ that terminates the char reference.
Faults detected:
WFC: XML-Spec.html#wf-Legalchar
According to Section 4.1 Character and Entity References of the XML Recommendation:
"[Definition: A character reference refers to a specific character in the ISO/IEC 10646 character set, for example one not directly accessible from available input devices.]"
Expands and handles a parsed-entity reference.
name is a symbol, the name of the parsed entity to expand. content-handler is a procedure of arguments port, entities, and seed that returns a seed. str-handler is called if the entity in question is a pre-declared entity.
ssax:handle-parsed-entity
returns the result returned by content-handler or str-handler.
Faults detected:
WFC: XML-Spec.html#wf-entdeclared
WFC: XML-Spec.html#norecursion
Add a name-value pair to the existing attlist, preserving its sorted ascending order; and return the new list. Return #f if a pair with the same name already exists in attlist
Given an non-null attlist, return a pair of values: the top and the rest.
This procedure reads and parses a production Attribute.
[41] Attribute ::= Name Eq AttValue [10] AttValue ::= '"' ([^<&"] | Reference)* '"' | "'" ([^<&'] | Reference)* "'" [25] Eq ::= S? '=' S?
The procedure returns an ATTLIST, of Name (as UNRES-NAME), Value (as string) pairs. The current character on the port is a non-whitespace character that is not an NCName-starting character.
Note the following rules to keep in mind when reading an AttValue:
Before the value of an attribute is passed to the application or checked for validity, the XML processor must normalize it as follows:
- A character reference is processed by appending the referenced character to the attribute value.
- An entity reference is processed by recursively processing the replacement text of the entity. The named entities ‘amp’, ‘lt’, ‘gt’, ‘quot’, and ‘apos’ are pre-declared.
- A whitespace character (#x20, #x0D, #x0A, #x09) is processed by appending #x20 to the normalized value, except that only a single #x20 is appended for a "#x0D#x0A" sequence that is part of an external parsed entity or the literal entity value of an internal parsed entity.
- Other characters are processed by appending them to the normalized value.
Faults detected:
WFC: XML-Spec.html#CleanAttrVals
WFC: XML-Spec.html#uniqattspec
Convert an unres-name to a RES-NAME, given the appropriate namespaces declarations. The last parameter, apply-default-ns?, determines if the default namespace applies (for instance, it does not for attribute names).
Per REC-xml-names/#nsc-NSDeclared, the "xml" prefix is considered pre-declared and bound to the namespace name "http://www.w3.org/XML/1998/namespace".
ssax:resolve-name
tests for the namespace constraints:
http://www.w3.org/TR/REC-xml-names/#nsc-NSDeclared
Complete parsing of a start-tag markup. ssax:complete-start-tag
must be called after the
start tag token has been read. tag is an UNRES-NAME. elems is an
instance of the ELEMS slot of XML-DECL; it can be #f to tell the
function to do no validation of elements and their
attributes.
ssax:complete-start-tag
returns several values:
On exit, the current position in port will be the first character after ‘>’ that terminates the start-tag markup.
Faults detected:
VC: XML-Spec.html#enum
VC: XML-Spec.html#RequiredAttr
VC: XML-Spec.html#FixedAttr
VC: XML-Spec.html#ValueType
WFC: XML-Spec.html#uniqattspec (after namespaces prefixes are resolved)
VC: XML-Spec.html#elementvalid
WFC: REC-xml-names/#dt-NSName
Note: although XML Recommendation does not explicitly say it, xmlns and xmlns: attributes don’t have to be declared (although they can be declared, to specify their default value).
Parses an ExternalID production:
[75] ExternalID ::= 'SYSTEM' S SystemLiteral | 'PUBLIC' S PubidLiteral S SystemLiteral [11] SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'") [12] PubidLiteral ::= '"' PubidChar* '"' | "'" (PubidChar - "'")* "'" [13] PubidChar ::= #x20 | #x0D | #x0A | [a-zA-Z0-9] | [-'()+,./:=?;!*#@$_%]
Call ssax:read-external-id
when an ExternalID is expected; that is, the current
character must be either #\S or #\P that starts correspondingly a
SYSTEM or PUBLIC token. ssax:read-external-id
returns the SystemLiteral as a
string. A PubidLiteral is disregarded if present.
These procedures parse productions corresponding to the whole (document) entity or its higher-level pieces (prolog, root element, etc).
Scan the Misc production in the context:
[1] document ::= prolog element Misc* [22] prolog ::= XMLDecl? Misc* (doctypedec l Misc*)? [27] Misc ::= Comment | PI | S
Call ssax:scan-misc
in the prolog or epilog contexts. In these contexts,
whitespaces are completely ignored. The return value from ssax:scan-misc
is
either a PI-token, a DECL-token, a START token, or *EOF*. Comments
are ignored and not reported.
Read the character content of an XML document or an XML element.
[43] content ::= (element | CharData | Reference | CDSect | PI | Comment)*
To be more precise, ssax:read-char-data
reads CharData, expands CDSect and character
entities, and skips comments. ssax:read-char-data
stops at a named reference, EOF,
at the beginning of a PI, or a start/end tag.
expect-eof? is a boolean indicating if EOF is normal; i.e., the character data may be terminated by the EOF. EOF is normal while processing a parsed entity.
iseed is an argument passed to the first invocation of str-handler.
ssax:read-char-data
returns two results: seed and token. The seed
is the result of the last invocation of str-handler, or the original iseed if str-handler
was never called.
token can be either an eof-object (this can happen only if expect-eof? was #t), or:
CDATA sections and character references are expanded inline and never returned. Comments are silently disregarded.
As the XML Recommendation requires, all whitespace in character data must be preserved. However, a CR character (#x0D) must be disregarded if it appears before a LF character (#x0A), or replaced by a #x0A character otherwise. See Secs. 2.10 and 2.11 of the XML Recommendation. See also the canonical XML Recommendation.
Make sure that token is of anticipated kind and has anticipated gi. Note that the gi argument may actually be a pair of two symbols, Namespace-URI or the prefix, and of the localname. If the assertion fails, error-cont is evaluated by passing it three arguments: token kind gi. The result of error-cont is returned.
These procedures are to instantiate a SSAX parser. A user can instantiate the parser to do the full validation, or no validation, or any particular validation. The user specifies which PI he wants to be notified about. The user tells what to do with the parsed character and element data. The latter handlers determine if the parsing follows a SAX or a DOM model.
Create a parser to parse and process one Processing Element (PI).
my-pi-handlers is an association list of pairs
(pi-tag . pi-handler)
where pi-tag is an
NCName symbol, the PI target; and pi-handler is a procedure
taking arguments port, pi-tag, and seed.
pi-handler should read the rest of the PI up to and including
the combination ‘?>’ that terminates the PI. The handler
should return a new seed. One of the pi-tags may be the
symbol *DEFAULT*
. The corresponding handler will handle PIs
that no other handler will. If the *DEFAULT* pi-tag is not
specified, ssax:make-pi-parser
will assume the default handler that skips the body of
the PI.
ssax:make-pi-parser
returns a procedure of arguments port, pi-tag, and
seed; that will parse the current PI according to my-pi-handlers.
Create a parser to parse and process one element, including its character content or children elements. The parser is typically applied to the root element of a document.
is a procedure taking arguments:
elem-gi attributes namespaces expected-content seed
where elem-gi is a RES-NAME of the element about to be processed.
my-new-level-seed is to generate the seed to be passed to handlers that process the content of the element.
is a procedure taking arguments:
elem-gi attributes namespaces parent-seed seed
my-finish-element is called when parsing of elem-gi is finished. The seed is the result from the last content parser (or from my-new-level-seed if the element has the empty content). parent-seed is the same seed as was passed to my-new-level-seed. my-finish-element is to generate a seed that will be the result of the element parser.
is a STR-HANDLER as described in Data Types above.
is as described for ssax:make-pi-handler
above.
The generated parser is a procedure taking arguments:
start-tag-head port elems entities namespaces preserve-ws? seed
The procedure must be called after the start tag token has been read. start-tag-head is an UNRES-NAME from the start-element tag. ELEMS is an instance of ELEMS slot of XML-DECL.
Faults detected:
VC: XML-Spec.html#elementvalid
WFC: XML-Spec.html#GIMatch
Create an XML parser, an instance of the XML parsing framework. This will be a SAX, a DOM, or a specialized parser depending on the supplied user-handlers.
ssax:make-parser
takes an even number of arguments; user-handler-tag is a symbol that identifies
a procedure (or association list for PROCESSING-INSTRUCTIONS
)
(user-handler) that follows the tag. Given below are tags and signatures of
the corresponding procedures. Not all tags have to be specified.
If some are omitted, reasonable defaults will apply.
handler-procedure: port docname systemid internal-subset? seed
If internal-subset? is #t, the current position in the port is
right after we have read ‘[’ that begins the internal DTD
subset. We must finish reading of this subset before we return (or
must call skip-internal-dtd
if we aren’t interested in
reading it). port at exit must be at the first symbol after
the whole DOCTYPE declaration.
The handler-procedure must generate four values:
elems entities namespaces seed
elems is as defined for the ELEMS slot of XML-DECL. It may be
#f to switch off validation. namespaces will typically
contain user-prefixes for selected uri-symbs. The
default handler-procedure skips the internal subset, if any, and
returns (values #f '() '() seed)
.
procedure: elem-gi seed
where elem-gi is an UNRES-NAME of the root element. This procedure is called when an XML document under parsing contains no DOCTYPE declaration.
The handler-procedure, as a DOCTYPE handler procedure above, must generate four values:
elems entities namespaces seed
The default handler-procedure returns (values #f ’() ’() seed)
procedure: elem-gi seed
where elem-gi is an UNRES-NAME of the root element. This procedure is called when an XML document under parsing does contains the DOCTYPE declaration. The handler-procedure must generate a new seed (and verify that the name of the root element matches the doctype, if the handler so wishes). The default handler-procedure is the identity function.
procedure: see ssax:make-elem-parser, my-new-level-seed
procedure: see ssax:make-elem-parser, my-finish-element
procedure: see ssax:make-elem-parser, my-char-data-handler
association list as is passed to ssax:make-pi-parser
.
The default value is ’()
The generated parser is a procedure of arguments port and seed.
This procedure parses the document prolog and then exits to an
element parser (created by ssax:make-elem-parser
) to handle
the rest.
[1] document ::= prolog element Misc* [22] prolog ::= XMLDecl? Misc* (doctypedec | Misc*)? [27] Misc ::= Comment | PI | S [28] doctypedecl ::= '<!DOCTYPE' S Name (S ExternalID)? S? ('[' (markupdecl | PEReference | S)* ']' S?)? '>' [29] markupdecl ::= elementdecl | AttlistDecl | EntityDecl | NotationDecl | PI | Comment
This is an instance of the SSAX parser that returns an SXML
representation of the XML document to be read from port. namespace-prefix-assig is a list
of (user-prefix . uri-string)
that assigns
user-prefixes to certain namespaces identified by particular
uri-strings. It may be an empty list. ssax:xml->sxml
returns an SXML
tree. The port points out to the first character after the root
element.
Next: Printing Scheme, Previous: URI, Up: Textual Conversion Packages [Contents][Index]