13 Everything (We Will Say) About Parsing
23 + 5 - 6 |
Parsing is a large, complex problem that is far from solved due to the difficulties of ambiguity. For instance, an alternate parse tree for the above input expression might put subtraction at the top and addition below it. We might also want to consider whether this addition operation is commutative and hence whether the order of arguments can be switched. Everything only gets much, much worse when we get to full-fledged programming languages (to say nothing of natural languages).
13.1 A Lightweight, Built-In First Half of a Parser
check:
read-sexpr("(+ 23 (- 5 6))") is ["+", 23, ["-", 5, 6]]
end
In this book we will use s-expressions to represent concrete syntax. This is helpful because the syntax is so different from that of Pyret, we will virtually never be confused as to what language we are reading. Since we will be writing programs to process programs, it is especially helpful to keep apart the program being processed and that doing the processing. For us, the former will be written in s-expressions and the latter in Pyret.
13.2 Completing the Parser
In principle, we can think of read-sexpr as a complete parser. However, its output is generic: it represents the token structure without offering any comment on its intent. We would instead prefer to have a representation that tells us something about the intended meaning of the terms in our language, just as we wrote at the very beginning: “representing addition”, “represents a number”, and so on.
data ArithC:
| numC (n :: Number)
| plusC (l :: ArithC, r :: ArithC)
| multC (l :: ArithC, r :: ArithC)
end
fun parse(s) -> ArithC:
if Number(s):
numC(s)
else if List(s):
cases (List) s:
| empty => raise("parse: unexpected empty list")
| link(op, args) =>
argL = list.index(args, 0)
argR = list.index(args, 1)
if op == "+":
plusC(parse(argL), parse(argR))
else if op == "*":
multC(parse(argL), parse(argR))
end
end
else:
raise("parse: not number or list")
end
end
check:
fun p(s): parse(read-sexpr(s)) end
p("3") is numC(3)
p("(+ 1 2)") is plusC(numC(1), numC(2))
p("(* (+ 1 2) (* 2 5))") is
multC(plusC(numC(1), numC(2)), multC(numC(2), numC(5)))
end
Congratulations! You have just completed your first representation of a program. From now on we can focus entirely on programs represented as recursive trees, ignoring the vagaries of surface syntax and how to get them into the tree form (though in practice, we will continue to use the s-expression notation because it’s easier to type than all those constructors). We’re finally ready to start studying programming languages!
If the test
p("3") is numC(3)
is instead written as
p(3) is numC(3)
what happens? Why?
13.3 Coda
The s-expression syntax dates back to 1960.“Recursive functions of symbolic expressions and their computation by machine, Part I” by John McCarthy in Communications of the ACM. This syntax is often controversial amongst programmers. Observe, however, something deeply valuable that it gives us. While parsing traditional languages can be very complex, parsing this syntax is virtually trivial. Given a sequence of tokens corresponding to the input, it is absolutely straightforward to turn parenthesized sequences into s-expressions; it is equally straightforward (as we see above) to turn s-expressions into proper syntax trees. I like to call such two-level languages bicameral, in loose analogy to government legislative houses: the lower-level does rudimentary well-formedness checking, while the upper-level does deeper validity checking. (We haven’t done any of the latter yet, but we will [REF].)
The virtues of this syntax are thus manifold. The amount of code it requires is small, and can easily be embedded in many contexts. By integrating the syntax into the language, it becomes easy for programs to manipulate representations of programs (as we will see more of in [REF]). It’s therefore no surprise that even though many Lisp-based syntaxes have had wildly different semantics, they all share this syntactic legacy.
Of course, we could just use XML instead. That would be much better. Or JSON. Because that wouldn’t be anything like an s-expression at all.