### 23Processing Programs: Parsing

#### 23.1Understanding Languages by Writing Programs About Them

We will understand the nature of languages by writing programs about them. These programs will implement many interesting features of languages from different perspectives, embodied in different actions:
• An interpreter will consume programs in a language and produce the answers they are expected to produce.

• A type checker will consume programs in a language and produce either true or false, depending on whether the program has consistent type annotations.

• A pretty-printer will consume programs in a language and print them, prettified in some way.

• A verifier will consume programs in a language and check whether they satisfy some stated property.

• A transformer will consume programs in a language and produce related but different programs in the same language.

• A transformer’s first cousin, a compiler, will consume programs in a language and produce related programs in a different language (which in turn can be interpreted, type-checked, pretty-printed, verified, transformed, even compiled...).

Observe that in each of these cases, we have to begin by consuming (the representation of) a program. We will briefly discuss how we do this quickly and easily, so that in the rest of our study we can focus on the remainder of each of these actions.

#### 23.2Everything (We Will Say) About Parsing

Parsing is a very general actvity whose difficulty depends both on how complex or ambiguous the input might be, and how much stucture we expect of the parser’s output. For our purposes, we would like the parser to be maximally helpful by providing later stages as much structure as possible. This forces us to either write a very complex parser or limit the forms of legal input. We will choose the latter.

A key problem of parsing is the management of ambiguity: when a given expression can be parsed in multiple different ways. For instance, the input
 23 + 5 * 6
could parse in two different ways: either the multiplication should be done first followed by addition, or vice versa. Though simple disambiguation rules (that you probably remember from middle school) disambigiuate arithmetic, the problem is much harder for full-fledged programming languages.

Ultimately, we would like to get rid of ambiguity once-and-for-all at the very beginning of processing the program, rather than deal with it repeatedly in each of the ways we might want to process it. Thus, if we follow the standard rules of arithmetic, we would want the above program to turn into a tree that has a (representation of) addition at its root, a (representation of) 23 as its left child, multiplication as its right child, and so on. This is called an abstract syntax tree: it is “abstract” because it represents the intent of the program rather than its literal syntactic structure (spaces, indentation, etc.); it is “syntax” because it represents the program that was given; and it is usually a “tree” but not always.

As we have said, we could push the problem of disambiguation onto a parser. This is what most real languages do. Because parsing is not our concern, we are instead going to ask the program’s author to use an unambiguous syntax. Indeed, we can exploit the decades of work that has been invested into wire format to represent programs. For instance, the above expression might be written—avoiding the ambiguity induced by not properly parenthesizing the program—as:

in XML, or as
 {plus: [{number: "23"}, {mult: [{number: "5"}, {number: "6"}]}]}
in JSON.

##### 23.2.1A Lightweight, Built-In First Half of a Parser

These are both worthy notations. Instead, we will use a related, and arguably even simpler, wire format known as the s-expression:The name comes from Lisp.
 (+ 23 (* 5 6))
Pyret has built-in support for processing s-expressions, so you can use this syntax and get support from the language to process it.

Do Now!

import s-exp as S
and then try the following:
S.read-s-exp("(+ 23 (* 5 6))")
Make sure you understand the output it produced and why it produced that.

You should have seen the following output:
check:
S.read-s-exp("(+ 23 (* 5 6))") is
S.s-list([list:
S.s-sym("+"),
S.s-num(23),
S.s-list([list:
S.s-sym("*"),
S.s-num(5),
S.s-num(6)])])
end

In this book we will use s-expressions to represent concrete syntax. This is helpful because the syntax is so different from that of Pyret, we will virtually never be confused as to what language we are reading. Since we will be writing programs to process programs, it is especially helpful to keep apart the program being processed and that doing the processing. For us, the former will be written in s-expressions and the latter in Pyret.

##### 23.2.2Completing the Parser

In principle, we can think of read-s-exp as a complete parser. However, its output is generic: it represents the token structure without offering any comment on its intent. We would instead prefer to have a representation that tells us something about the intended meaning of the terms in our language, just as we wrote at the very beginning: “(representation of) multiplication”, and so on.

To do this, first let’s import the necessary libraries:
import s-exp as S
import lists as L
Now down to business. We must first introduce a datatype that captures this representation. We will separately discuss (Representing Arithmetic) how and why we obtained this datatype, but for now let’s say it’s given to us:
data ArithC:
| numC(n :: Number)
| plusC(l :: ArithC, r :: ArithC)
| multC(l :: ArithC, r :: ArithC)
end
We then need a function that will convert s-expressions into instances of this datatype. This is the other half of our parser:
fun parse(s :: S.S-Exp) -> ArithC:
cases (S.S-Exp) s:
| s-num(n) => numC(n)
cases (List) s:
| empty => raise("parse: unexpected empty list")
argL = L.get(args, 0)
argR = L.get(args, 1)
if op.s == "+":
plusC(parse(argL), parse(argR))
else if op.s == "*":
multC(parse(argL), parse(argR))
end
end
| else =>
raise("parse: not number or list")
end
end
This obeys the following tests:Note the use of a helper function inside the block of tests.
check:
p("3") is numC(3)
p("(+ 1 2)") is plusC(numC(1), numC(2))
p("(* (+ 1 2) (* 2 5))") is
multC(plusC(numC(1), numC(2)), multC(numC(2), numC(5)))
end

Congratulations! You have just completed your first representation of a program. From now on we can focus entirely on programs represented as recursive trees, ignoring the vagaries of surface syntax and how to get them into the tree form (though in practice, we will continue to use the s-expression notation because it’s easier to type than all those constructors). We’re finally ready to start studying programming languages!

Exercise

If the test
p("3") is numC(3)
p(3) is numC(3)