Syntax 1

Learning objectives

Students should be able to:

Give examples of syntax-related language design choices,
Given some example programs, propose a grammar,
Given a grammar, write some programs.

Intro

When you build a language, the first thing that your language users will see is the syntax; the way that programs are written down.

We need to figure out how to specify what strings correspond to valid programs.

What is syntax?

The form of programs–how programs can be written as text
There is a rich literature about syntax, and lots of tools for manipulating it
- See lexing, parsing, formal languages, grammars
- Unfortunately, a lot of this literature exists because we have good formalisms for thinking about syntax, so it's a case of looking where there is light

The syntax design space

We'll start thinking about syntax by looking at some of the options that language designers have when building a syntax for their language.
Here's a recursive factorial program written in Python¹:

def factorial(n):
  if n == 0:
    return 1
  return n * factorial(n - 1)

And here's the same program written in C:

int factorial(int n) {
  return n == 0 ? 1 : n * factorial(n - 1);
}

These two programs have different runtime behavior of course, but what do you notice about their syntax? What are some of the design choices that differ between the two languages?
- Python uses whitespace to denote nesting and C uses curly brackets.
- C requires semicolons to end lines; Python requires the programmer to add a linebreak.
- In C, we're using a ternary if-then-else and in Python we're using a block based if-then.
- C requires us to write down types for the function inputs and output, but we'll skip over that for now.
What are some of the similarities?
- Both languages use infix operators for comparison and arithmetic.
- Function calls are written $f(x)$ and require parentheses around the arguments.
Why might we prefer one syntax over the other?
- People often describe Python syntax as "readable" or "natural." Do you find one of these programs to be more readable? Which one?
Here's a factorial program written in Fortran. Are its syntactic choices more similar to C or to Python?

INTEGER RECURSIVE FUNCTION FACTORIAL(X) RESULT(ANS)
  INTEGER, INTENT(IN) :: X

  IF (X == 0) THEN
    ANS = 1
  ELSE
    ANS = X * FACTORIAL(X-1)
  END IF
END FUNCTION FACTORIAL

I would say C, because it uses delimiters to define blocks, but it's using a block-based if-then-else construct that looks a bit like our Python code.
What are some of the new syntactic choices that we can see in this code?
- We have a new way to delimit blocks: using keywords like END IF. We also have to write the name of the function after END FUNCTION.
- We have a new way to return from functions: using designated return parameters. In this case, we return a value from the function by assigning it to ANS.
- We now have to explicitly call out that our function is recursive.
Moving a little farther afield, this version is written in Common Lisp:

(defun factorial (n)
  (if (= n 0) 1 (* n (factorial (1- n))))

We have a ternary if-then-else that looks a bit like the C version. We're using parentheses to delimit everything—blocks, function calls, expressions.
We're using prefix notation instead of infix notation for our arithmetic operators. That is, instead of writing x + 1, we write (+ x 1).
Look closely at the recursive call. The argument might look like 1 - n, but it isn't! 1- is actually a unary decrement operator! This hints at another syntactic choice: Common Lisp is very flexible about letting us write function and variable names that have symbols in them.
- Was this a good decision for Common Lisp? On one hand, it's potentially confusing to have symbols in variable names. On the other hand, maybe we should try to give programmers as much flexibility as possible.
Finally, let's look at a Haskell example:

factorial :: Integer -> Integer
factorial 0 = 1
factorial n = n * (factorial $ n - 1)

This looks pretty different from the other programs that we've seen so far. What are some of the key differences?
We're splitting the program definition into multiple clauses. Instead of the if-then-else structure that we saw in the other programs, we have one clause that handles the $n = 0$ case and one clause that handles $n \neq 0$.
Function applications look pretty different. We don't need parentheses around the arguments anymore, and we have this strange $ operator in there. The $ operator means "apply," so f $ x should be read as "apply function f to argument x." (We could also write factorial $ n - 1 as factorial (n - 1), which looks a lot more like the function call syntax we're used to.)
Although it's not visible in this example, Haskell has Python-like significant whitespace, so indentation is an important part of the program structure.

Syntax Design Principles

Consistency

Syntax can be consistent in two ways: external and internal.

A syntax is externally consistent if it reuses syntax conventions from other languages. It's much easier for users to start writing your language if they can transfer their knowledge from another context. In practice, this means that when adding a concept (e.g. array access, while loops, function calls) to your language, you should consider reusing syntax from a similar language before you make up your own.

Internal consistency means that related operations should look similar. For example, arrays and lists offer similar operations (literals [1,2,3], access by index arr[i], length len(arr)). Consistency suggests that the syntax for these operations should be the same, or at least visually similar.

Compositionality

The meaning of a phrase should be determined by the meaning of the individual parts and the method of composition. In the case of syntax, the individual parts are the lexical tokens (e.g. for, +, ;, def, etc.)

In C++, we can write type<type<arg>>, which denotes a type parameterized over another parameterized type (e.g. list<option<int>>). In this context, >> has to be read as two separate tokens for this syntax to make sense. C++ also has a bitwise operator >>, which should be a single token. We can't reason about the meaning of these phrases by looking at how the individual parts compose, because we are uncertain about what the individual parts should be.

Concision

Shorter programs are (often) better than longer ones. While it's certainly possible to write syntax that is too terse, you should try to make common operations straightforward to write.

For example, in Rust it is common to check whether a function call returns Ok or Err and if it returns Err to bubble that error up by returning it immediately. This pattern is common enough that Rust offers special syntax for it:

maybe_fails(34)?

is equivalent to

match maybe_fails(34) {
Ok (x) => x,
Err(e) => return Err(e)
}

Go also expects programmers to handle errors by returning special values, but it does not provide special syntax for this common pattern. This leads to a significant amount of repeated code, where programmers manually check for an error and return it if it exists.

ret, err := maybe_fails(34)
if err != nil {
  return err
}

Ambiguity

One issue we will run into when designing syntax is ambiguity. A grammar is ambiguous when there are multiple ways to parse a phrase; that is, there are multiple possible syntax trees for a given program. This is a problem when parsing, as parsing deterministic grammars is much more efficient than parsing nondeterministic ones. It's also a problem for the semantics of the language, if the different syntax trees lead to different ASTs and then to different program semantics.

We'll discuss a few common ways that ambiguity arises in programming language syntax.

Precedence and associativity

We want to allow users of our language to write expressions naturally, without requiring excessive parentheses. Consider an expression grammar:

<expr> ::= "x" | "y" | "z"
	 | <expr> "-" <expr>
	 | <expr> "*" <expr>
	 | "(" <expr> ")"

This grammar accepts x - y - z * z, but there are multiple possible syntax trees with different semantics:

Figure 1: This isn't the semantics we'd intuitively expect.

There are two issues at play in this grammar: precedence and associativity.

Precedence

We say that an operator $op_1$ has a higher precedence than an operator $op_2$ if it must be evaluated first. Conventionally, multiplication has a higher precedence than subtraction, which rules out (x - (y - z)) * z because the multiplication must be evaluated first. When designing our syntax, we need to explicitly spell out the operator precedence order. Given a precedence order, we can modify the grammar to only produce syntax trees that reflect that precedence. For example, the following grammar enforces the higher precedence of multiplication:

<expr> ::= <expr> "-" <expr> | <expr1>
<expr1> ::= <expr1> "*" <expr1> | <expr2>
<expr2> ::= "x" | "y" | "z" | "(" <expr> ")"

This grammar rules out 1.

Associativity

Fixing operator precedence does not fully eliminate the issues with ambiguity in our grammar because we can still produce multiple parse trees that differ in the associativity of subtraction. We need to spell out how operators associate—or group together—in the absence of parentheses.

An operator can be associative, left-associative, right-associative or non-associative.

An associative operator has the property that the way it is grouped in the AST does not change its semantics. Examples include multiplication and addition: $a + b + c \equiv (a + b) + c \equiv a + (b + c)$. However, in practice we want to these operators to associate either left or right so that our grammar is unambiguous.
A left-associative operator should be grouped to the left, so that the leftmost application is the deepest in the resulting AST. Examples include subtraction and division. For example, 1 - 2 - 3 - 4 should be read as ((1 - 2) - 3) - 4.
A right-associative operator is grouped so that the rightmost application is deepest. Examples include exponentiation and the list constructor :. For example, in Haskell we read 1 : 2 : 3 : [] as 1 : (2 : (3 : [])), and this is equivalent to the list [1,2,3].
A non-associative operator can't be chained at all, usually because it has different output and argument types. For example, in many languages comparisons like < are non-associative because they consume numbers and produce booleans.

We can further modify our grammar to enforce the appropriate associativity:

<expr> ::= <expr> "-" <expr1> | <expr1>
<expr1> ::= <expr2> "*" <expr1> | <expr2>
<expr2> ::= "x" | "y" | "z" | "(" <expr> ")"

This ensures that we get the expected concrete syntax tree:

Dangling else

Many languages have an if-then-else construct with an optional else clause, e.g.:

if A then B else C
if A then B

The problem arises when these constructs are nested. How should we read the following code?

if A then if B then C else D

It can be parsed as either

if A then (if B then C) else D

if A then (if B then C else D)

C and Java both suffer from this problem (although most style guides require the use of braces, which eliminates it). They deal with the ambiguity by always associating the else with the nearest if (the second option above).

Other languages enforce the use of braces or other delimiters, either in if constructs that have an else or in all block contexts.

Still other languages use different syntax for the if-then construction than if-then-else.

Footnotes:

Programs taken from Rosetta Code.

Learning objectives#

Intro#

What is syntax?#

The syntax design space#

Syntax Design Principles#

Consistency#

Compositionality#

Concision#

Ambiguity#

Precedence and associativity#

Precedence#

Associativity#

Dangling else#

Further Reading#