Next: , Previous: , Up: Textual Conversion Packages   [Contents][Index]

4.1 Precedence Parsing

(require 'precedence-parse) or (require 'parse)

This package implements:


4.1.1 Precedence Parsing Overview

This package offers improvements over previous parsers.

The notion of binding power may be unfamiliar to those accustomed to BNF grammars.

When two consecutive objects are parsed, the first might be the prefix to the second, or the second might be a suffix of the first. Comparing the left and right binding powers of the two objects decides which way to interpret them.

Objects at each level of syntactic grouping have binding powers.

A syntax tree is not built unless the rules explicitly do so. The call graph of grammar rules effectively instantiate the sytnax tree.

The JACAL symbolic math system (http://people.csail.mit.edu/jaffer/JACAL) uses precedence-parse. Its grammar definitions in the file jacal/English.scm can serve as examples of use.


4.1.2 Rule Types

Here are the higher-level syntax types and an example of each. Precedence considerations are omitted for clarity. See Grammar Rule Definition for full details.

Grammar: nofix bye exit
bye

calls the function exit with no arguments.

Grammar: prefix - negate
- 42

Calls the function negate with the argument 42.

Grammar: infix - difference
x - y

Calls the function difference with arguments x and y.

Grammar: nary + sum
x + y + z

Calls the function sum with arguments x, y, and y.

Grammar: postfix ! factorial
5 !

Calls the function factorial with the argument 5.

Grammar: prestfix set set!
set foo bar

Calls the function set! with the arguments foo and bar.

Grammar: commentfix /* */
/* almost any text here */

Ignores the comment delimited by /* and */.

Grammar: matchfix { list }
{0, 1, 2}

Calls the function list with the arguments 0, 1, and 2.

Grammar: inmatchfix ( funcall )
f(x, y)

Calls the function funcall with the arguments f, x, and y.

Grammar: delim ;
set foo bar;

delimits the extent of the restfix operator set.


4.1.3 Ruleset Definition and Use

Variable: *syn-defs*

A grammar is built by one or more calls to prec:define-grammar. The rules are appended to *syn-defs*. The value of *syn-defs* is the grammar suitable for passing as an argument to prec:parse.

Constant: *syn-ignore-whitespace*

Is a nearly empty grammar with whitespace characters set to group 0, which means they will not be made into tokens. Most rulesets will want to start with *syn-ignore-whitespace*

In order to start defining a grammar, either

(set! *syn-defs* '())

or

(set! *syn-defs* *syn-ignore-whitespace*)
Function: prec:define-grammar rule1 …

Appends rule1 … to *syn-defs*. prec:define-grammar is used to define both the character classes and rules for tokens.

Once your grammar is defined, save the value of *syn-defs* in a variable (for use when calling prec:parse).

(define my-ruleset *syn-defs*)
Function: prec:parse ruleset delim column
Function: prec:parse ruleset delim column port

The ruleset argument must be a list of rules as constructed by prec:define-grammar and extracted from *syn-defs*.

The token delim may be a character, symbol, or string. A character delim argument will match only a character token; i.e. a character for which no token-group is assigned. A symbol or string will match only a token string; i.e. a token resulting from a token group.

prec:parse reads a ruleset grammar expression delimited by delim from the given input port. prec:parse returns the next object parsable from the given input port, updating port to point to the first character past the end of the external representation of the object.

For the purpose of reporting problems in error messages, this package keeps track of the current column. Its initial value is passed as the third argument to prec:parse.

If an end of file is encountered in the input before any characters are found that can begin an object, then an end of file object is returned. If a delimiter (such as delim) is found before any characters are found that can begin an object, then #f is returned.

The port argument may be omitted, in which case it defaults to the value returned by current-input-port. It is an error to parse from a closed port.


4.1.4 Token definition

Function: tok:char-group group chars chars-proc

The argument chars may be a single character, a list of characters, or a string. Each character in chars is treated as though tok:char-group was called with that character alone.

The argument chars-proc must be a procedure of one argument, a list of characters. After tokenize has finished accumulating the characters for a token, it calls chars-proc with the list of characters. The value returned is the token which tokenize returns.

The argument group may be an exact integer or a procedure of one character argument. The following discussion concerns the treatment which the tokenizing routine, tokenize, will accord to characters on the basis of their groups.

When group is a non-zero integer, characters whose group number is equal to or exactly one less than group will continue to accumulate. Any other character causes the accumulation to stop (until a new token is to be read).

The group of zero is special. These characters are ignored when parsed pending a token, and stop the accumulation of token characters when the accumulation has already begun. Whitespace characters are usually put in group 0.

If group is a procedure, then, when triggerd by the occurence of an initial (no accumulation) chars character, this procedure will be repeatedly called with each successive character from the input stream until the group procedure returns a non-false value.

The following convenient constants are provided for use with tok:char-group.

Constant: tok:decimal-digits

Is the string "0123456789".

Constant: tok:upper-case

Is the string consisting of all upper-case letters ("ABCDEFGHIJKLMNOPQRSTUVWXYZ").

Constant: tok:lower-case

Is the string consisting of all lower-case letters ("abcdefghijklmnopqrstuvwxyz").

Constant: tok:whitespaces

Is the string consisting of all characters between 0 and 255 for which char-whitespace? returns true.


4.1.5 Nud and Led Definition

This section describes advanced features. You can skip this section on first reading.

The Null Denotation (or nud) of a token is the procedure and arguments applying for that token when Left, an unclaimed parsed expression is not extant.

The Left Denotation (or led) of a token is the procedure, arguments, and lbp applying for that token when there is a Left, an unclaimed parsed expression.

In his paper,

Pratt, V. R. Top Down Operator Precendence. SIGACT/SIGPLAN Symposium on Principles of Programming Languages, Boston, 1973, pages 41-51

the left binding power (or lbp) was an independent property of tokens. I think this was done in order to allow tokens with NUDs but not LEDs to also be used as delimiters, which was a problem for statically defined syntaxes. It turns out that dynamically binding NUDs and LEDs allows them independence.

For the rule-defining procedures that follow, the variable tk may be a character, string, or symbol, or a list composed of characters, strings, and symbols. Each element of tk is treated as though the procedure were called for each element.

Character tk arguments will match only character tokens; i.e. characters for which no token-group is assigned. Symbols and strings will both match token strings; i.e. tokens resulting from token groups.

Function: prec:make-nud tk sop arg1 …

Returns a rule specifying that sop be called when tk is parsed. If sop is a procedure, it is called with tk and arg1 … as its arguments; the resulting value is incorporated into the expression being built. Otherwise, (list sop arg1 …) is incorporated.

If no NUD has been defined for a token; then if that token is a string, it is converted to a symbol and returned; if not a string, the token is returned.

Function: prec:make-led tk sop arg1 …

Returns a rule specifying that sop be called when tk is parsed and left has an unclaimed parsed expression. If sop is a procedure, it is called with left, tk, and arg1 … as its arguments; the resulting value is incorporated into the expression being built. Otherwise, left is incorporated.

If no LED has been defined for a token, and left is set, the parser issues a warning.


4.1.6 Grammar Rule Definition

Here are procedures for defining rules for the syntax types introduced in Precedence Parsing Overview.

For the rule-defining procedures that follow, the variable tk may be a character, string, or symbol, or a list composed of characters, strings, and symbols. Each element of tk is treated as though the procedure were called for each element.

For procedures prec:delim, …, prec:prestfix, if the sop argument is #f, then the token which triggered this rule is converted to a symbol and returned. A false sop argument to the procedures prec:commentfix, prec:matchfix, or prec:inmatchfix has a different meaning.

Character tk arguments will match only character tokens; i.e. characters for which no token-group is assigned. Symbols and strings will both match token strings; i.e. tokens resulting from token groups.

Function: prec:delim tk

Returns a rule specifying that tk should not be returned from parsing; i.e. tk’s function is purely syntactic. The end-of-file is always treated as a delimiter.

Function: prec:nofix tk sop

Returns a rule specifying the following actions take place when tk is parsed:

  • If sop is a procedure, it is called with no arguments; the resulting value is incorporated into the expression being built. Otherwise, the list of sop is incorporated.
Function: prec:prefix tk sop bp rule1 …

Returns a rule specifying the following actions take place when tk is parsed:

  • The rules rule1 … augment and, in case of conflict, override rules currently in effect.
  • prec:parse1 is called with binding-power bp.
  • If sop is a procedure, it is called with the expression returned from prec:parse1; the resulting value is incorporated into the expression being built. Otherwise, the list of sop and the expression returned from prec:parse1 is incorporated.
  • The ruleset in effect before tk was parsed is restored; rule1 … are forgotten.
Function: prec:infix tk sop lbp bp rule1 …

Returns a rule declaring the left-binding-precedence of the token tk is lbp and specifying the following actions take place when tk is parsed:

  • The rules rule1 … augment and, in case of conflict, override rules currently in effect.
  • One expression is parsed with binding-power lbp. If instead a delimiter is encountered, a warning is issued.
  • If sop is a procedure, it is applied to the list of left and the parsed expression; the resulting value is incorporated into the expression being built. Otherwise, the list of sop, the left expression, and the parsed expression is incorporated.
  • The ruleset in effect before tk was parsed is restored; rule1 … are forgotten.
Function: prec:nary tk sop bp

Returns a rule declaring the left-binding-precedence of the token tk is bp and specifying the following actions take place when tk is parsed:

  • Expressions are parsed with binding-power bp as far as they are interleaved with the token tk.
  • If sop is a procedure, it is applied to the list of left and the parsed expressions; the resulting value is incorporated into the expression being built. Otherwise, the list of sop, the left expression, and the parsed expressions is incorporated.
Function: prec:postfix tk sop lbp

Returns a rule declaring the left-binding-precedence of the token tk is lbp and specifying the following actions take place when tk is parsed:

  • If sop is a procedure, it is called with the left expression; the resulting value is incorporated into the expression being built. Otherwise, the list of sop and the left expression is incorporated.
Function: prec:prestfix tk sop bp rule1 …

Returns a rule specifying the following actions take place when tk is parsed:

  • The rules rule1 … augment and, in case of conflict, override rules currently in effect.
  • Expressions are parsed with binding-power bp until a delimiter is reached.
  • If sop is a procedure, it is applied to the list of parsed expressions; the resulting value is incorporated into the expression being built. Otherwise, the list of sop and the parsed expressions is incorporated.
  • The ruleset in effect before tk was parsed is restored; rule1 … are forgotten.
Function: prec:commentfix tk stp match rule1 …

Returns rules specifying the following actions take place when tk is parsed:

  • The rules rule1 … augment and, in case of conflict, override rules currently in effect.
  • Characters are read until and end-of-file or a sequence of characters is read which matches the string match.
  • If stp is a procedure, it is called with the string of all that was read between the tk and match (exclusive).
  • The ruleset in effect before tk was parsed is restored; rule1 … are forgotten.

Parsing of commentfix syntax differs from the others in several ways. It reads directly from input without tokenizing; It calls stp but does not return its value; nay any value. I added the stp argument so that comment text could be echoed.

Function: prec:matchfix tk sop sep match rule1 …

Returns a rule specifying the following actions take place when tk is parsed:

  • The rules rule1 … augment and, in case of conflict, override rules currently in effect.
  • A rule declaring the token match a delimiter takes effect.
  • Expressions are parsed with binding-power 0 until the token match is reached. If the token sep does not appear between each pair of expressions parsed, a warning is issued.
  • If sop is a procedure, it is applied to the list of parsed expressions; the resulting value is incorporated into the expression being built. Otherwise, the list of sop and the parsed expressions is incorporated.
  • The ruleset in effect before tk was parsed is restored; rule1 … are forgotten.
Function: prec:inmatchfix tk sop sep match lbp rule1 …

Returns a rule declaring the left-binding-precedence of the token tk is lbp and specifying the following actions take place when tk is parsed:

  • The rules rule1 … augment and, in case of conflict, override rules currently in effect.
  • A rule declaring the token match a delimiter takes effect.
  • Expressions are parsed with binding-power 0 until the token match is reached. If the token sep does not appear between each pair of expressions parsed, a warning is issued.
  • If sop is a procedure, it is applied to the list of left and the parsed expressions; the resulting value is incorporated into the expression being built. Otherwise, the list of sop, the left expression, and the parsed expressions is incorporated.
  • The ruleset in effect before tk was parsed is restored; rule1 … are forgotten.

Footnotes

(4)

How do I know this? I parsed 250kbyte of random input (an e-mail file) with a non-trivial grammar utilizing all constructs.


Next: , Previous: , Up: Textual Conversion Packages   [Contents][Index]