Parsing Using Recursive-descent
Tokenization
Here's our tokenizer for Imp
:
from typing import Optional
@dataclass
class Token:
kind: str # Token type (e.g. 'LCURLY', 'VAR', 'INT')
value: Optional[str] = None # Token value (e.g. '{', 'x', '45')
line: int = -1 # Line number (default = -1)
column: int = -1 # Column number (default = -1)
def tokenize(code : str) -> list[Token]:
"""`tokenize(code)` splits `code` into a list of tokens. Tokens are matched
according to `token_specification`. If more than one regular expression
matches, then the first match will be preferred.
"""
pos = 0
line_num = 1
column = 0
line_start = 0
ret = []
while len(code) > 0:
for (kind, regex) in token_specification:
m = re.compile(regex).match(code, pos)
if m is not None:
break
if m is None:
ret.append(Token("EOF", "", line_num, column))
break
pos = m.end()
value = m.group()
column = m.start() - line_start
if kind == "NEWLINE" or kind == "COMMENT":
line_start = m.end()
line_num += 1
continue
elif kind == "SKIP":
continue
elif kind == "VAR":
if value == "print":
kind = "PRINT"
elif kind == "MISMATCH":
raise RuntimeError(f"{value!r} unexpected at line:{line_num} col:{column}")
ret.append(Token(kind, value, line_num, column))
return ret
This function is parameterized over token_specification
, which is a list of (token name, token regex) pairs.
You will likely be able to write your tokenizer by only modifying token_specification
.
token_specification = [
("LPAREN", r"\("),
("RPAREN", r"\)"),
("SEMI", r";"),
("COMMENT", r"//.*\n"), # C++ style // comments
("INT", r"[0-9]+"),
("VAR", r"[a-zA-Z_][a-zA-Z0-9_]*"),
("SET", r"="),
("BINOP", r"[\+\*-/]"),
("NEWLINE", r"\n"), # Tokenize newlines separately from whitespace so
# they can be counted
("SKIP", r"[ \t]+"), # Skip over spaces and tabs
("MISMATCH", r"."), # Any other character
]
One issue with our tokenizer that you wouldn't run into using a real lexer is that it's important that the token regexes not overlap.
If we were to (for example), extend token_specification
with a "while"
token, we would either be unable to create identifiers like while_id
or we would lex "while" as an identifier.
The simplest way to deal with this problem is to look at how we handle print
.
It's lexed as an identifier and then converted to a print token.
At the end of the input, our tokenizer emits a special EOF token. This is convenient for writing the parser, because it makes checking for the end of the input the same as checking the next token.
Grammar
Here's the grammar for Imp
we presented in lecture:
This grammar isn't suitable for parsing with recursive descent because it:
- Has left-recursion in the <expr> rule,
- Is ambiguous about the precedence of arithmetic operators.
We can fix the grammar by first inlining the <binop> rule:
fixing precedence by splitting the <expr> nonterminal into one nonterminal for each operator precedence level:
and finally rewriting left-recursion as iteration:
The resulting grammar is ready for us to write a parser.
Syntax
We use dataclasses to define the abstract syntax of Imp
as follows:
@dataclass
class Expr:
pass
@dataclass
class Var(Expr):
name: str
@dataclass
class Int(Expr):
val: int
@dataclass
class Binop(Expr):
op: str
lhs: Expr
rhs: Expr
@dataclass
class Stmt:
pass
@dataclass
class Print(Stmt):
expr: Expr
@dataclass
class Set(Stmt):
var: str
expr: Expr
@dataclass
class Prog:
stmts: list[Stmt]
Parsing
Here's the skeleton of our parser.
def pop_or_none(list_):
if len(list_) > 0:
return list_.pop(0)
class Parser:
def __init__(self, tokens):
self.tokens = tokens
def expr(self):
def expr1(self):
def expr2(self):
def stmt(self):
def prog(self):
Each method should parse the corresponding nonterminal or fail. If the method fails, it should not remove tokens from the list. If it succeeds, then it should remove every token that is part of the parsed portion.
First we'll look in detail at the expr2
method.
match self.tokens:
case [tok, *rest] if tok.kind == 'VAR':
self.tokens = rest
return Var(tok.value)
case [tok, *rest] if tok.kind == 'INT':
self.tokens = rest
return Int(int(tok.value))
case [tok, *rest] if tok.kind == 'LPAREN':
self.tokens = rest
ret = self.expr()
match self.tokens:
case [tok, *rest] if tok.kind == 'RPAREN':
self.tokens = rest
return ret
case _:
raise RuntimeError(f"Expected ')' but got {tok}")
case [tok, _]:
raise RuntimeError(f"Unexpected {tok}")
To parse an <expr2>, we look ahead at the next token. If that token is a variable name or integer, we immediately return the appropriate AST. If it is a left paren, we remove the token and recursively parse an <expr>. We then ensure that there is a matching right paren and return the expression. If at any point we get a token that we don't expect, we raise an exception.
As discussed in class, left-associative operators require a modified parsing strategy to avoid the infinite loops caused by left-recursion. We reframed our grammar to use repetition instead of recursion. Here's how we'll write that repetition in code:
exprs = [self.expr1()]
while self.tokens[0].kind == 'BINOP' and self.tokens[0].value in ['+', '-']:
exprs.append(self.tokens[0].value)
self.tokens = self.tokens[1:]
exprs.append(self.expr1())
return associate(exprs)
We collect the expressions in a list (e.g. [ Id('a'), '+', Id('b'), '-', Id('c') ]
), and then use the associate
helper function to build the corresponding AST:
def associate(exprs):
if len(exprs) == 1:
return exprs[0]
return associate([Binop(exprs[1], exprs[0], exprs[2])] + exprs[3:])
Testing
An important part of implementing your parser is creating tests.
We'll be using Pytest as our test runner.
Pytest will run any function whose name starts with test
.
Here's an example test from our parser:
def test_prog():
tokens = tokenize(
"""
print (1 + 2 * 3);
x = y / 7 * 9 + 1;
"""
)
assert Parser(tokens).prog() == Prog(
[
Print(Binop("+", Int(1), Binop("*", Int(2), Int(3)))),
Set(
"x",
Binop("+", Binop("*", Binop("/", Var("y"), Int(7)), Int(9)), Int(1)),
),
]
)
We have an example program that we tokenize and parse, and we check that it produces the expected AST.