Parsing Using Recursive-descent

Tokenization

Here's our tokenizer for Imp:

from typing import Optional

@dataclass
class Token:
    kind: str                   # Token type (e.g. 'LCURLY', 'VAR', 'INT')
    value: Optional[str] = None # Token value (e.g. '{', 'x', '45')
    line: int = -1              # Line number (default = -1)
    column: int = -1            # Column number (default = -1)

def tokenize(code : str) -> list[Token]:
    """`tokenize(code)` splits `code` into a list of tokens. Tokens are matched
    according to `token_specification`. If more than one regular expression
    matches, then the first match will be preferred.

    """
    pos = 0
    line_num = 1
    column = 0
    line_start = 0
    ret = []
    while len(code) > 0:
	for (kind, regex) in token_specification:
	    m = re.compile(regex).match(code, pos)
	    if m is not None:
		break

	if m is None:
	    ret.append(Token("EOF", "", line_num, column))
	    break

	pos = m.end()
	value = m.group()
	column = m.start() - line_start
	if kind == "NEWLINE" or kind == "COMMENT":
	    line_start = m.end()
	    line_num += 1
	    continue
	elif kind == "SKIP":
	    continue
	elif kind == "VAR":
	    if value == "print":
		kind = "PRINT"
	elif kind == "MISMATCH":
	    raise RuntimeError(f"{value!r} unexpected at line:{line_num} col:{column}")
	ret.append(Token(kind, value, line_num, column))
    return ret

This function is parameterized over token_specification, which is a list of (token name, token regex) pairs. You will likely be able to write your tokenizer by only modifying token_specification.

token_specification = [
    ("LPAREN", r"\("),
    ("RPAREN", r"\)"),
    ("SEMI", r";"),
    ("COMMENT", r"//.*\n"), # C++ style // comments
    ("INT", r"[0-9]+"),
    ("VAR", r"[a-zA-Z_][a-zA-Z0-9_]*"),
    ("SET", r"="),
    ("BINOP", r"[\+\*-/]"),
    ("NEWLINE", r"\n"),     # Tokenize newlines separately from whitespace so
			    # they can be counted
    ("SKIP", r"[ \t]+"),    # Skip over spaces and tabs
    ("MISMATCH", r"."),     # Any other character
]

One issue with our tokenizer that you wouldn't run into using a real lexer is that it's important that the token regexes not overlap. If we were to (for example), extend token_specification with a "while" token, we would either be unable to create identifiers like while_id or we would lex "while" as an identifier. The simplest way to deal with this problem is to look at how we handle print. It's lexed as an identifier and then converted to a print token.

At the end of the input, our tokenizer emits a special EOF token. This is convenient for writing the parser, because it makes checking for the end of the input the same as checking the next token.

Grammar

Here's the grammar for Imp we presented in lecture:

\begin{align} \newcommand{\nonterm}[1]{\texttt{<#1>}} \nonterm{int} &::= \texttt{[0-9]+} \\ \nonterm{id} &::= \texttt{[a-zA-Z]+} \\ \nonterm{binop} &::= \texttt{"+"} ~|~ \texttt{"-"} ~|~ \texttt{"*"} ~|~ \texttt{"/"} \\ \nonterm{prog} &::= (\nonterm{stmt}\ \texttt{";"})+ \\ \nonterm{stmt} &::= \texttt{print}\ \nonterm{expr} ~|~ \nonterm{id} = \nonterm{expr} \\ \nonterm{expr} &::= \nonterm{int} ~|~ \nonterm{id} ~|~ \nonterm{expr}\ \nonterm{binop}\ \nonterm{expr} ~|~ \texttt{"("}\ \nonterm{expr}\ \texttt{")"} \end{align}

This grammar isn't suitable for parsing with recursive descent because it:

  1. Has left-recursion in the <expr> rule,
  2. Is ambiguous about the precedence of arithmetic operators.

We can fix the grammar by first inlining the <binop> rule:

\begin{align} \newcommand{\nonterm}[1]{\texttt{<#1>}} \nonterm{expr} &::= \nonterm{int} ~|~ \nonterm{id} ~|~ \texttt{"("}\ \nonterm{expr}\ \texttt{")"} ~|~ \nonterm{expr}\ (\texttt{"+"} ~|~ \texttt{"-"} ~|~ \texttt{"*"} ~|~ \texttt{"/"})\ \nonterm{expr} \end{align}

fixing precedence by splitting the <expr> nonterminal into one nonterminal for each operator precedence level:

\begin{align} \newcommand{\nonterm}[1]{\texttt{<#1>}} \nonterm{expr} &::= \nonterm{expr1} ~|~ \nonterm{expr}\ (\texttt{"+"} ~|~ \texttt{"-"})\ \nonterm{expr1} \\ \nonterm{expr1} &::= \nonterm{expr2} ~|~ \nonterm{expr1}\ (\texttt{"*"} ~|~ \texttt{"/"})\ \nonterm{expr2} \\ \nonterm{expr2} &::= \nonterm{int} ~|~ \nonterm{id} ~|~ \texttt{"("}\ \nonterm{expr}\ \texttt{")"} \end{align}

and finally rewriting left-recursion as iteration:

\begin{align} \newcommand{\nonterm}[1]{\texttt{<#1>}} \nonterm{expr} &::= \nonterm{expr1} ((\texttt{"+"} ~|~ \texttt{"-"})\ \nonterm{expr1})* \\ \nonterm{expr1} &::= \nonterm{expr2} ((\texttt{"*"} ~|~ \texttt{"/"})\ \nonterm{expr2})* \\ \nonterm{expr2} &::= \nonterm{int} ~|~ \nonterm{id} ~|~ \texttt{"("}\ \nonterm{expr}\ \texttt{")"} \end{align}

The resulting grammar is ready for us to write a parser.

Syntax

We use dataclasses to define the abstract syntax of Imp as follows:

@dataclass
class Expr:
    pass

@dataclass
class Var(Expr):
    name: str

@dataclass
class Int(Expr):
    val: int

@dataclass
class Binop(Expr):
    op: str
    lhs: Expr
    rhs: Expr

@dataclass
class Stmt:
    pass

@dataclass
class Print(Stmt):
    expr: Expr

@dataclass
class Set(Stmt):
    var: str
    expr: Expr

@dataclass
class Prog:
    stmts: list[Stmt]

Parsing

Here's the skeleton of our parser.

def pop_or_none(list_):
    if len(list_) > 0:
	return list_.pop(0)

class Parser:
    def __init__(self, tokens):
	self.tokens = tokens

    def expr(self):


    def expr1(self):


    def expr2(self):


    def stmt(self):


    def prog(self):

Each method should parse the corresponding nonterminal or fail. If the method fails, it should not remove tokens from the list. If it succeeds, then it should remove every token that is part of the parsed portion.

First we'll look in detail at the expr2 method.

match self.tokens:
    case [tok, *rest] if tok.kind == 'VAR':
	self.tokens = rest
	return Var(tok.value)

    case [tok, *rest] if tok.kind == 'INT':
	self.tokens = rest
	return Int(int(tok.value))

    case [tok, *rest] if tok.kind == 'LPAREN':
	self.tokens = rest
	ret = self.expr()
	match self.tokens:
	    case [tok, *rest] if tok.kind == 'RPAREN':
		self.tokens = rest
		return ret
	    case _:
		raise RuntimeError(f"Expected ')' but got {tok}")

    case [tok, _]:
	raise RuntimeError(f"Unexpected {tok}")

To parse an <expr2>, we look ahead at the next token. If that token is a variable name or integer, we immediately return the appropriate AST. If it is a left paren, we remove the token and recursively parse an <expr>. We then ensure that there is a matching right paren and return the expression. If at any point we get a token that we don't expect, we raise an exception.

As discussed in class, left-associative operators require a modified parsing strategy to avoid the infinite loops caused by left-recursion. We reframed our grammar to use repetition instead of recursion. Here's how we'll write that repetition in code:

exprs = [self.expr1()]
while self.tokens[0].kind == 'BINOP' and self.tokens[0].value in ['+', '-']:
    exprs.append(self.tokens[0].value)
    self.tokens = self.tokens[1:]
    exprs.append(self.expr1())

return associate(exprs)

We collect the expressions in a list (e.g. [ Id('a'), '+', Id('b'), '-', Id('c') ]), and then use the associate helper function to build the corresponding AST:

def associate(exprs):
    if len(exprs) == 1:
	return exprs[0]
    return associate([Binop(exprs[1], exprs[0], exprs[2])] + exprs[3:])

Testing

An important part of implementing your parser is creating tests. We'll be using Pytest as our test runner. Pytest will run any function whose name starts with test. Here's an example test from our parser:

def test_prog():
    tokens = tokenize(
	"""
    print (1 + 2 * 3);
    x = y / 7 * 9 + 1;
    """
    )
    assert Parser(tokens).prog() == Prog(
	[
	    Print(Binop("+", Int(1), Binop("*", Int(2), Int(3)))),
	    Set(
		"x",
		Binop("+", Binop("*", Binop("/", Var("y"), Int(7)), Int(9)), Int(1)),
	    ),
	]
    )

We have an example program that we tokenize and parse, and we check that it produces the expected AST.

Last updated: 2023-05-04 Thu 20:26