On this page:
13.1 A Lightweight, Built-In First Half of a Parser
13.2 Completing the Parser
13.3 Coda

13 Everything (We Will Say) About Parsing

    13.1 A Lightweight, Built-In First Half of a Parser

    13.2 Completing the Parser

    13.3 Coda

Parsing is the act of turning an input character stream into a more structured, internal representation. A common internal representation is as a tree, which programs can recursively process. For instance, given the stream

23 + 5 - 6

we might want a tree representing addition whose left node represents the number 23 and whose right node represents subtraction of 6 from 5. A parser is responsible for performing this transformation.

Parsing is a large, complex problem that is far from solved due to the difficulties of ambiguity. For instance, an alternate parse tree for the above input expression might put subtraction at the top and addition below it. We might also want to consider whether this addition operation is commutative and hence whether the order of arguments can be switched. Everything only gets much, much worse when we get to full-fledged programming languages (to say nothing of natural languages).

13.1 A Lightweight, Built-In First Half of a Parser

These problems make parsing a worthy topic in its own right, and entire books, tools, and courses are devoted to it. However, from our perspective parsing is mostly a distraction, because we want to study the parts of programming languages that are not parsing. We will therefore exploit a handy feature of Pyret to manage the transformation of input streams into trees: read-sexpr. read-sexpr consumes a prefix-parenthetical input syntax and parses it fully and unambiguously into a built-in tree made up of lists. For instance, applied to a fully-parenthesized version of the example above—The name “sexpr” comes from Lisp, Scheme, and Racket, where this syntax is called an s-expression.

check:

  read-sexpr("(+ 23 (- 5 6))") is ["+", 23, ["-", 5, 6]]

end

produce a list, whose first element is the string "+", second element is the number 23, and third element is a list; this list’s first element is the string "-", second element is the number 5, and third element is the number 6.

In this book we will use s-expressions to represent concrete syntax. This is helpful because the syntax is so different from that of Pyret, we will virtually never be confused as to what language we are reading. Since we will be writing programs to process programs, it is especially helpful to keep apart the program being processed and that doing the processing. For us, the former will be written in s-expressions and the latter in Pyret.

13.2 Completing the Parser

In principle, we can think of read-sexpr as a complete parser. However, its output is generic: it represents the token structure without offering any comment on its intent. We would instead prefer to have a representation that tells us something about the intended meaning of the terms in our language, just as we wrote at the very beginning: “representing addition”, “represents a number”, and so on.

To do this, we must first introduce a datatype that captures this representation. We will separately discuss (Representing Arithmetic) how and why we obtained this datatype, but for now let’s say it’s given to us:

data ArithC:

  | numC (n :: Number)

  | plusC (l :: ArithC, r :: ArithC)

  | multC (l :: ArithC, r :: ArithC)

end

We now need a function that will convert s-expressions into instances of this datatype. This is the other half of our parser:

fun parse(s) -> ArithC:

  if Number(s):

    numC(s)

  else if List(s):

    cases (List) s:

      | empty => raise("parse: unexpected empty list")

      | link(op, args) =>

        argL = list.index(args, 0)

        argR = list.index(args, 1)

        if op == "+":

          plusC(parse(argL), parse(argR))

        else if op == "*":

          multC(parse(argL), parse(argR))

        end

    end

  else:

    raise("parse: not number or list")

  end

end

This obeys the following tests:Note the use of a helper function inside the block of tests.

check:

  fun p(s): parse(read-sexpr(s)) end

  p("3") is numC(3)

  p("(+ 1 2)") is plusC(numC(1), numC(2))

  p("(* (+ 1 2) (* 2 5))") is

    multC(plusC(numC(1), numC(2)), multC(numC(2), numC(5)))

end

Congratulations! You have just completed your first representation of a program. From now on we can focus entirely on programs represented as recursive trees, ignoring the vagaries of surface syntax and how to get them into the tree form (though in practice, we will continue to use the s-expression notation because it’s easier to type than all those constructors). We’re finally ready to start studying programming languages!

Exercise

If the test

p("3") is numC(3)

is instead written as

p(3) is numC(3)

what happens? Why?

13.3 Coda

The s-expression syntax dates back to 1960.“Recursive functions of symbolic expressions and their computation by machine, Part I” by John McCarthy in Communications of the ACM. This syntax is often controversial amongst programmers. Observe, however, something deeply valuable that it gives us. While parsing traditional languages can be very complex, parsing this syntax is virtually trivial. Given a sequence of tokens corresponding to the input, it is absolutely straightforward to turn parenthesized sequences into s-expressions; it is equally straightforward (as we see above) to turn s-expressions into proper syntax trees. I like to call such two-level languages bicameral, in loose analogy to government legislative houses: the lower-level does rudimentary well-formedness checking, while the upper-level does deeper validity checking. (We haven’t done any of the latter yet, but we will [REF].)

The virtues of this syntax are thus manifold. The amount of code it requires is small, and can easily be embedded in many contexts. By integrating the syntax into the language, it becomes easy for programs to manipulate representations of programs (as we will see more of in [REF]). It’s therefore no surprise that even though many Lisp-based syntaxes have had wildly different semantics, they all share this syntactic legacy.

Of course, we could just use XML instead. That would be much better. Or JSON. Because that wouldn’t be anything like an s-expression at all.