24 Processing Programs: Parsing
24.1 Understanding Languages by Writing Programs About Them
An interpreter will consume programs in a language and produce the answers they are expected to produce.
A type checker will consume programs in a language and produce either true or false, depending on whether the program has consistent type annotations.
A pretty-printer will consume programs in a language and print them, prettified in some way.
A verifier will consume programs in a language and check whether they satisfy some stated property.
A transformer will consume programs in a language and produce related but different programs in the same language.
A transformer’s first cousin, a compiler, will consume programs in a language and produce related programs in a different language (which in turn can be interpreted, type-checked, pretty-printed, verified, transformed, even compiled...).
24.2 Everything (We Will Say) About Parsing
☛ Parsing is a very general actvity whose difficulty depends both on how complex or ambiguous the input might be, and how much stucture we expect of the parser’s output. For our purposes, we would like the parser to be maximally helpful by providing later stages as much structure as possible. This forces us to either write a very complex parser or limit the forms of legal input. We will choose the latter.
23 + 5 * 6 |
Ultimately, we would like to get rid of ambiguity once-and-for-all at the very beginning of processing the program, rather than deal with it repeatedly in each of the ways we might want to process it. Thus, if we follow the standard rules of arithmetic, we would want the above program to turn into a tree that has a (representation of) addition at its root, a (representation of) 23 as its left child, multiplication as its right child, and so on. This is called an abstract syntax tree: it is “abstract” because it represents the intent of the program rather than its literal syntactic structure (spaces, indentation, etc.); it is “syntax” because it represents the program that was given; and it is usually a “tree” but not always.
<plus> |
<args> |
<arg position="1"> |
<number value="23"/> |
</arg> |
<arg position="2"> |
<mult> |
<args> |
<arg position="1"> |
<number value="5"/> |
</arg> |
<arg position="2"> |
<number value="6"/> |
</arg> |
</args> |
</mult> |
</arg> |
</args> |
</plus> |
{plus: |
[{number: "23"}, |
{mult: |
[{number: "5"}, |
{number: "6"}]}]} |
24.2.1 A Lightweight, Built-In First Half of a Parser
(+ 23 (* 5 6)) |
Do Now!
Load the s-expression library withimport s-exp as S
and then try the following:S.read-s-exp("(+ 23 (* 5 6))")
Make sure you understand the output it produced and why it produced that.
check: S.read-s-exp("(+ 23 (* 5 6))") is S.s-list([list: S.s-sym("+"), S.s-num(23), S.s-list([list: S.s-sym("*"), S.s-num(5), S.s-num(6)])]) end
In this book we will use s-expressions to represent concrete syntax. This is helpful because the syntax is so different from that of Pyret, we will virtually never be confused as to what language we are reading. Since we will be writing programs to process programs, it is especially helpful to keep apart the program being processed and that doing the processing. For us, the former will be written in s-expressions and the latter in Pyret.
24.2.2 Completing the Parser
In principle, we can think of read-s-exp as a complete parser. However, its output is generic: it represents the token structure without offering any comment on its intent. We would instead prefer to have a representation that tells us something about the intended meaning of the terms in our language, just as we wrote at the very beginning: “(representation of) multiplication”, and so on.
import s-exp as S import lists as L
data ArithC: | numC(n :: Number) | plusC(l :: ArithC, r :: ArithC) | multC(l :: ArithC, r :: ArithC) end
fun parse(s :: S.S-Exp) -> ArithC: cases (S.S-Exp) s: | s-num(n) => numC(n) | s-list(shadow s) => cases (List) s: | empty => raise("parse: unexpected empty list") | link(op, args) => argL = L.get(args, 0) argR = L.get(args, 1) if op.s == "+": plusC(parse(argL), parse(argR)) else if op.s == "*": multC(parse(argL), parse(argR)) end end | else => raise("parse: not number or list") end end
check: fun p(s): parse(S.read-s-exp(s)) end p("3") is numC(3) p("(+ 1 2)") is plusC(numC(1), numC(2)) p("(* (+ 1 2) (* 2 5))") is multC(plusC(numC(1), numC(2)), multC(numC(2), numC(5))) end
Congratulations! You have just completed your first representation of a program. From now on we can focus entirely on programs represented as recursive trees, ignoring the vagaries of surface syntax and how to get them into the tree form (though in practice, we will continue to use the s-expression notation because it’s easier to type than all those constructors). We’re finally ready to start studying programming languages!
Exercise
If the testp("3") is numC(3)
is instead written asp(3) is numC(3)
what happens? Why?
24.2.3 Coda
The s-expression syntax dates back to 1960.“Recursive functions of symbolic expressions and their computation by machine, Part I” by John McCarthy in Communications of the ACM. This syntax is often controversial amongst programmers. Observe, however, something deeply valuable that it gives us. While parsing traditional languages can be very complex, parsing this syntax is virtually trivial. Given a sequence of tokens corresponding to the input, it is absolutely straightforward to turn parenthesized sequences into s-expressions; it is equally straightforward (as we see above) to turn s-expressions into proper syntax trees. We like to call such two-level languages bicameral,The term is introduced in PLAI. in loose analogy to government legislative houses: the lower-level does rudimentary well-formedness checking, while the upper-level does deeper validity checking.
The virtues of this syntax are thus manifold. The amount of code it
requires is small, and can easily be embedded in many contexts. By
integrating the syntax into the language, it becomes easy for programs
to manipulate representations of programs (as we will see more of in
[Desugaring: Growing the Language Without Enlarging It].
It’s therefore no surprise that even though many Lisp-based
languages—
Of course, we could just use XML instead. That might be much nicer. Or JSON. Because that wouldn’t be anything like an s-expression at all.