Next: Format (version 3.1), Previous: Textual Conversion Packages, Up: Textual Conversion Packages [Contents][Index]
(require 'precedence-parse)
or (require 'parse)
This package implements:
Next: Rule Types, Previous: Precedence Parsing, Up: Precedence Parsing [Contents][Index]
This package offers improvements over previous parsers.
?
is substituted for
missing input.
The notion of binding power may be unfamiliar to those accustomed to BNF grammars.
When two consecutive objects are parsed, the first might be the prefix to the second, or the second might be a suffix of the first. Comparing the left and right binding powers of the two objects decides which way to interpret them.
Objects at each level of syntactic grouping have binding powers.
A syntax tree is not built unless the rules explicitly do so. The call graph of grammar rules effectively instantiate the sytnax tree.
The JACAL symbolic math system (http://people.csail.mit.edu/jaffer/JACAL) uses precedence-parse. Its grammar definitions in the file jacal/English.scm can serve as examples of use.
Next: Ruleset Definition and Use, Previous: Precedence Parsing Overview, Up: Precedence Parsing [Contents][Index]
Here are the higher-level syntax types and an example of each. Precedence considerations are omitted for clarity. See Grammar Rule Definition for full details.
bye
calls the function exit
with no arguments.
- 42
Calls the function negate
with the argument 42
.
x - y
Calls the function difference
with arguments x
and y
.
x + y + z
Calls the function sum
with arguments x
, y
, and
y
.
5 !
Calls the function factorial
with the argument 5
.
set foo bar
Calls the function set!
with the arguments foo
and
bar
.
/* almost any text here */
Ignores the comment delimited by /*
and */
.
{0, 1, 2}
Calls the function list
with the arguments 0
, 1
,
and 2
.
f(x, y)
Calls the function funcall
with the arguments f
, x
,
and y
.
set foo bar;
delimits the extent of the restfix operator set
.
Next: Token definition, Previous: Rule Types, Up: Precedence Parsing [Contents][Index]
A grammar is built by one or more calls to prec:define-grammar
.
The rules are appended to *syn-defs*. The value of
*syn-defs* is the grammar suitable for passing as an argument to
prec:parse
.
Is a nearly empty grammar with whitespace characters set to group 0,
which means they will not be made into tokens. Most rulesets will want
to start with *syn-ignore-whitespace*
In order to start defining a grammar, either
(set! *syn-defs* '())
or
(set! *syn-defs* *syn-ignore-whitespace*)
Appends rule1 … to *syn-defs*.
prec:define-grammar
is used to define both the character classes
and rules for tokens.
Once your grammar is defined, save the value of *syn-defs*
in a
variable (for use when calling prec:parse
).
(define my-ruleset *syn-defs*)
The ruleset argument must be a list of rules as constructed by
prec:define-grammar
and extracted from *syn-defs*.
The token delim may be a character, symbol, or string. A character delim argument will match only a character token; i.e. a character for which no token-group is assigned. A symbol or string will match only a token string; i.e. a token resulting from a token group.
prec:parse
reads a ruleset grammar expression delimited
by delim from the given input port. prec:parse
returns the next object parsable from the given input port,
updating port to point to the first character past the end of the
external representation of the object.
For the purpose of reporting problems in error messages, this package
keeps track of the current column. Its initial value is passed
as the third argument to prec:parse
.
If an end of file is encountered in the input before any characters are
found that can begin an object, then an end of file object is returned.
If a delimiter (such as delim) is found before any characters are
found that can begin an object, then #f
is returned.
The port argument may be omitted, in which case it defaults to the
value returned by current-input-port
. It is an error to parse
from a closed port.
Next: Nud and Led Definition, Previous: Ruleset Definition and Use, Up: Precedence Parsing [Contents][Index]
The argument chars may be a single character, a list of
characters, or a string. Each character in chars is treated as
though tok:char-group
was called with that character alone.
The argument chars-proc must be a procedure of one argument, a
list of characters. After tokenize
has finished
accumulating the characters for a token, it calls chars-proc with
the list of characters. The value returned is the token which
tokenize
returns.
The argument group may be an exact integer or a procedure of one
character argument. The following discussion concerns the treatment
which the tokenizing routine, tokenize
, will accord to characters
on the basis of their groups.
When group is a non-zero integer, characters whose group number is equal to or exactly one less than group will continue to accumulate. Any other character causes the accumulation to stop (until a new token is to be read).
The group of zero is special. These characters are ignored when parsed pending a token, and stop the accumulation of token characters when the accumulation has already begun. Whitespace characters are usually put in group 0.
If group is a procedure, then, when triggerd by the occurence of an initial (no accumulation) chars character, this procedure will be repeatedly called with each successive character from the input stream until the group procedure returns a non-false value.
The following convenient constants are provided for use with
tok:char-group
.
Is the string "0123456789"
.
Is the string consisting of all upper-case letters ("ABCDEFGHIJKLMNOPQRSTUVWXYZ").
Is the string consisting of all lower-case letters ("abcdefghijklmnopqrstuvwxyz").
Is the string consisting of all characters between 0 and 255 for which
char-whitespace?
returns true.
Next: Grammar Rule Definition, Previous: Token definition, Up: Precedence Parsing [Contents][Index]
This section describes advanced features. You can skip this section on first reading.
The Null Denotation (or nud) of a token is the procedure and arguments applying for that token when Left, an unclaimed parsed expression is not extant.
The Left Denotation (or led) of a token is the procedure, arguments, and lbp applying for that token when there is a Left, an unclaimed parsed expression.
In his paper,
Pratt, V. R. Top Down Operator Precendence. SIGACT/SIGPLAN Symposium on Principles of Programming Languages, Boston, 1973, pages 41-51
the left binding power (or lbp) was an independent property of tokens. I think this was done in order to allow tokens with NUDs but not LEDs to also be used as delimiters, which was a problem for statically defined syntaxes. It turns out that dynamically binding NUDs and LEDs allows them independence.
For the rule-defining procedures that follow, the variable tk may be a character, string, or symbol, or a list composed of characters, strings, and symbols. Each element of tk is treated as though the procedure were called for each element.
Character tk arguments will match only character tokens; i.e. characters for which no token-group is assigned. Symbols and strings will both match token strings; i.e. tokens resulting from token groups.
Returns a rule specifying that sop be called when tk is
parsed. If sop is a procedure, it is called with tk and
arg1 … as its arguments; the resulting value is incorporated
into the expression being built. Otherwise, (list sop
arg1 …)
is incorporated.
If no NUD has been defined for a token; then if that token is a string, it is converted to a symbol and returned; if not a string, the token is returned.
Returns a rule specifying that sop be called when tk is parsed and left has an unclaimed parsed expression. If sop is a procedure, it is called with left, tk, and arg1 … as its arguments; the resulting value is incorporated into the expression being built. Otherwise, left is incorporated.
If no LED has been defined for a token, and left is set, the parser issues a warning.
Previous: Nud and Led Definition, Up: Precedence Parsing [Contents][Index]
Here are procedures for defining rules for the syntax types introduced in Precedence Parsing Overview.
For the rule-defining procedures that follow, the variable tk may be a character, string, or symbol, or a list composed of characters, strings, and symbols. Each element of tk is treated as though the procedure were called for each element.
For procedures prec:delim, …, prec:prestfix, if the sop
argument is #f
, then the token which triggered this rule is
converted to a symbol and returned. A false sop argument to the
procedures prec:commentfix, prec:matchfix, or prec:inmatchfix has a
different meaning.
Character tk arguments will match only character tokens; i.e. characters for which no token-group is assigned. Symbols and strings will both match token strings; i.e. tokens resulting from token groups.
Returns a rule specifying that tk should not be returned from parsing; i.e. tk’s function is purely syntactic. The end-of-file is always treated as a delimiter.
Returns a rule specifying the following actions take place when tk is parsed:
Returns a rule specifying the following actions take place when tk is parsed:
prec:parse1
is called with binding-power bp.
prec:parse1
; the resulting value is incorporated into the
expression being built. Otherwise, the list of sop and the
expression returned from prec:parse1
is incorporated.
Returns a rule declaring the left-binding-precedence of the token tk is lbp and specifying the following actions take place when tk is parsed:
Returns a rule declaring the left-binding-precedence of the token tk is bp and specifying the following actions take place when tk is parsed:
Returns a rule declaring the left-binding-precedence of the token tk is lbp and specifying the following actions take place when tk is parsed:
Returns a rule specifying the following actions take place when tk is parsed:
Returns rules specifying the following actions take place when tk is parsed:
Parsing of commentfix syntax differs from the others in several ways. It reads directly from input without tokenizing; It calls stp but does not return its value; nay any value. I added the stp argument so that comment text could be echoed.
Returns a rule specifying the following actions take place when tk is parsed:
0
until the token
match is reached. If the token sep does not appear between
each pair of expressions parsed, a warning is issued.
Returns a rule declaring the left-binding-precedence of the token tk is lbp and specifying the following actions take place when tk is parsed:
0
until the token
match is reached. If the token sep does not appear between
each pair of expressions parsed, a warning is issued.
How do I know this? I parsed 250kbyte of random input (an e-mail file) with a non-trivial grammar utilizing all constructs.
Next: Format (version 3.1), Previous: Textual Conversion Packages, Up: Textual Conversion Packages [Contents][Index]