On this page:
12.1 Understanding Languages by Writing Programs About Them
12.2 Everything (We Will Say) About Parsing
12.2.1 A Lightweight, Built-In First Half of a Parser
12.2.2 Completing the Parser
12.2.3 Coda

12 Processing Programs: Parsing

    12.1 Understanding Languages by Writing Programs About Them

    12.2 Everything (We Will Say) About Parsing

      12.2.1 A Lightweight, Built-In First Half of a Parser

      12.2.2 Completing the Parser

      12.2.3 Coda

12.1 Understanding Languages by Writing Programs About Them

We will understand the nature of languages by writing programs about them. These programs will implement many interesting features of languages from different perspectives, embodied in different actions:
  • An interpreter will consume programs in a language and produce the answers they are expected to produce.

  • A type checker will consume programs in a language and produce either true or false, depending on whether the program has consistent type annotations.

  • A pretty-printer will consume programs in a language and print them, prettified in some way.

  • A verifier will consume programs in a language and check whether they satisfy some stated property.

  • A transformer will consume programs in a language and produce related but different programs in the same language.

  • A transformer’s first cousin, a compiler, will consume programs in a language and produce related programs in a different language (which in turn can be interpreted, type-checked, pretty-printed, verified, transformed, even compiled...).

Observe that in each of these cases, we have to begin by consuming (the representation of) a program. We will briefly discuss how we do this quickly and easily, so that in the rest of our study we can focus on the remainder of each of these actions.

12.2 Everything (We Will Say) About Parsing

Parsing is a very general actvity whose difficulty depends both on how complex or ambiguous the input might be, and how much stucture we expect of the parser’s output. For our purposes, we would like the parser to be maximally helpful by providing later stages as much structure as possible. This forces us to either write a very complex parser or limit the forms of legal input. We will choose the latter.

A key problem of parsing is the management of ambiguity: when a given expression can be parsed in multiple different ways. For instance, the input

23 + 5 * 6

could parse in two different ways: either the multiplication should be done first followed by addition, or vice versa. Though simple disambiguation rules (that you probably remember from middle school) disambigiuate arithmetic, the problem is much harder for full-fledged programming languages.

Ultimately, we would like to get rid of ambiguity once-and-for-all at the very beginning of processing the program, rather than deal with it repeatedly in each of the ways we might want to process it. Thus, if we follow the standard rules of arithmetic, we would want the above program to turn into a tree that has a (representation of) addition at its root, a (representation of) 23 as its left child, multiplication as its right child, and so on. This is called an abstract syntax tree: it is “abstract” because it represents the intent of the program rather than its literal syntactic structure (spaces, indentation, etc.); it is “syntax” because it represents the program that was given; and it is usually a “tree” but not always.

As we have said, we could push the problem of disambiguation onto a parser. This is what most real languages do. Because parsing is not our concern, we are instead going to ask the program’s author to use an unambiguous syntax. Indeed, we can exploit the decades of work that has been invested into wire format to represent programs. For instance, the above expression might be written—avoiding the ambiguity induced by not properly parenthesizing the program—as:

<plus>

  <args>

    <arg position="1">

      <number value="23"/>

    </arg>

    <arg position="2">

      <mult>

        <args>

          <arg position="1">

            <number value="5"/>

          </arg>

          <arg position="2">

            <number value="6"/>

          </arg>

        </args>

      </mult>

    </arg>

  <args>

</plus>

in XML, or as

{plus:

  [{number: "23"},

   {mult:

     [{number: "5"},

      {number: "6"}]}]}

in JSON.

12.2.1 A Lightweight, Built-In First Half of a Parser

These are both worthy notations. Instead, we will use a related, and arguably even simpler, wire format known as the s-expression:The name comes from Lisp.

(+ 23 (* 5 6))

Pyret has built-in support for processing s-expressions, so you can use this syntax and get support from the language to process it.

Do Now!

Load the s-expression library with

import s-exp as S

and then try the following:

S.read-s-exp("(+ 23 (* 5 6))")

Make sure you understand the output it produced and why it produced that.

You should have seen the following output:

check: S.read-s-exp("(+ 23 (* 5 6))") is S.s-list([list: S.s-sym("+"), S.s-num(23), S.s-list([list: S.s-sym("*"), S.s-num(5), S.s-num(6)])]) end

In this book we will use s-expressions to represent concrete syntax. This is helpful because the syntax is so different from that of Pyret, we will virtually never be confused as to what language we are reading. Since we will be writing programs to process programs, it is especially helpful to keep apart the program being processed and that doing the processing. For us, the former will be written in s-expressions and the latter in Pyret.

12.2.2 Completing the Parser

In principle, we can think of read-s-exp as a complete parser. However, its output is generic: it represents the token structure without offering any comment on its intent. We would instead prefer to have a representation that tells us something about the intended meaning of the terms in our language, just as we wrote at the very beginning: “(representation of) multiplication”, and so on.

To do this, first let’s import the necessary libraries:

import s-exp as S import lists as L

Now down to business. We must first introduce a datatype that captures this representation. We will separately discuss (Representing Arithmetic) how and why we obtained this datatype, but for now let’s say it’s given to us:

data ArithC: | numC (n :: Number) | plusC (l :: ArithC, r :: ArithC) | multC (l :: ArithC, r :: ArithC) end

We then need a function that will convert s-expressions into instances of this datatype. This is the other half of our parser:

fun parse(s :: S.S-Exp) -> ArithC: cases (S.S-Exp) s: | s-num(n) => numC(n) | s-list(shadow s) => cases (List) s: | empty => raise("parse: unexpected empty list") | link(op, args) => argL = L.index(args, 0) argR = L.index(args, 1) if op.s == "+": plusC(parse(argL), parse(argR)) else if op.s == "*": multC(parse(argL), parse(argR)) end end | else => raise("parse: not number or list") end end

This obeys the following tests:Note the use of a helper function inside the block of tests.

check: fun p(s): parse(S.read-s-exp(s)) end p("3") is numC(3) p("(+ 1 2)") is plusC(numC(1), numC(2)) p("(* (+ 1 2) (* 2 5))") is multC(plusC(numC(1), numC(2)), multC(numC(2), numC(5))) end

Congratulations! You have just completed your first representation of a program. From now on we can focus entirely on programs represented as recursive trees, ignoring the vagaries of surface syntax and how to get them into the tree form (though in practice, we will continue to use the s-expression notation because it’s easier to type than all those constructors). We’re finally ready to start studying programming languages!

Exercise

If the test

p("3") is numC(3)

is instead written as

p(3) is numC(3)

what happens? Why?

12.2.3 Coda

The s-expression syntax dates back to 1960.“Recursive functions of symbolic expressions and their computation by machine, Part I” by John McCarthy in Communications of the ACM. This syntax is often controversial amongst programmers. Observe, however, something deeply valuable that it gives us. While parsing traditional languages can be very complex, parsing this syntax is virtually trivial. Given a sequence of tokens corresponding to the input, it is absolutely straightforward to turn parenthesized sequences into s-expressions; it is equally straightforward (as we see above) to turn s-expressions into proper syntax trees. I like to call such two-level languages bicameral, in loose analogy to government legislative houses: the lower-level does rudimentary well-formedness checking, while the upper-level does deeper validity checking. (We haven’t done any of the latter yet, but we will [REF].)

The virtues of this syntax are thus manifold. The amount of code it requires is small, and can easily be embedded in many contexts. By integrating the syntax into the language, it becomes easy for programs to manipulate representations of programs (as we will see more of in [REF]). It’s therefore no surprise that even though many Lisp-based languages—from Lisp 1.5 to Common Lisp to Scheme to Racket to Clojure and more—have had wildly different semantics, they all share this syntactic legacy.

Of course, we could just use XML instead. That might be much nicer. Or JSON. Because that wouldn’t be anything like an s-expression at all.