4 Introduction to Tabular Data

← prev up next →

4 Introduction to Tabular Data

4.1 Creating Tabular Data

4.2 Processing Rows

4.2.1 Keeping

4.2.2 Ordering

4.2.3 Combining Keeping and Ordering

4.2.4 Extending

4.2.5 Transforming, Cleansing, and Normalizing

4.2.6 Selecting

4.2.7 Summary of Row-Wise Table Operations

Many interesting data in computing are tabular—i.e., like a table—in form. First we’ll see a few examples of them, before we try to identify what they have in common. Here are some of them:

An email inbox is a list of messages. For each message, your inbox stores a bunch of information: its sender, the subject line, the conversation it’s part of, the body, and quite a bit more.
A music playlist. For each song, your music player maintains a bunch of information: its name, the singer, its length, its genre, and so on.
A filesystem folder or directory. For each file, your filesystem records a name, a modification date, size, and other information.

Do Now!
Can you come up with more examples?

How about:

Responses to a party invitation.
A gradebook.
A calendar agenda.

You can think of many more in your life!

What do all these have in common? The characteristics of tabular data are:

They consists of rows and columns. For instance, each song or email message or file is a row. Each of their characteristics—the song title, the message subject, the filename—is a column.
Each row has the same columns as the other rows, in the same order.
A given column has the same type, but different columns can have different types. For instance, an email message has a sender’s name, which is a string; a subject line, which is a string; a sent date, which is a date; whether it’s been read, which is a Boolean; and so on.
The rows are usually in some particular order. For instance, the emails are ordered by which was most recently sent.

Exercise
Find the characteristics of tabular data in the other examples described above, as well as in the ones you described.

We will now learn how to program with tables and to think about decomposing tasks involving them.You can also look up the full Pyret documentation for table operations.

4.1 Creating Tabular Data

Pyret provides multiple easy ways of creating tabular data. The simplest is to define the datum in a program as follows:

table: name, age row: "Alice", 30 row: "Bob", 40 row: "Carol", 25 end

That is, a table is followed by the names of the columns in their desired order, followed by a sequence of rows. Each row must contain as many data as the column declares, and in the same order.

Exercise
Change different parts of the above example—e.g., remove a necessary value from a row, add an extraneous one, remove a comma, add an extra comma, leave an extra comma at the end of a row—and see what errors you get.

Note that in a table, the order of columns matters: two tables that are otherwise identical but with different column orders are not considered equal.

check: table: name, age row: "Alice", 30 row: "Bob", 40 row: "Carol", 25 end is-not table: age, name row: 30, "Alice" row: 40, "Bob" row: 25, "Carol" end end

Observe that the example above uses is-not, i.e., the test passes, meaning that the tables are not equal.

Table expressions create table values. These can be stored in variables just like numbers, strings, and images:

people = table: name, age row: "Alice", 30 row: "Bob", 40 row: "Carol", 25 end

We call these literal tables when we create them with table. Pyret provides other ways to get tabular data, too! In particular, you can import tabular data from a spreadsheet, so any mechanism that lets you create such a sheet can also be used. You might:

create the sheet on your own,
create a sheet collaboratively with friends,
find data on the Web that you can import into a sheet,
create a Google Form that you get others to fill out, and obtain a sheet out of their responses

and so on. Let your imagination run wild! Once the data are in Pyret, the language doesn’t care where they came from.

4.2 Processing Rows

Let’s now learn how we can actually process a table. Pyret offers a variety of built-in operations that make it quite easy to perform interesting computations over tables. In addition, as we will see later [REF], if we don’t find these sufficient, we can write our own. For now, we’ll focus on the operations Pyret provides.

Let’s think about some of the questions we might want to ask about our data:

Which emails were sent by a particular user?
Which songs were sung by a particular artist?
Which are the most frequently played songs in a playlist?
Which are the least frequently played songs in a playlist?

We see that some of these correspond to keeping some rows and some correspond to ordering them. Pyret provides tabular operations corresponding to these.

4.2.1 Keeping

Some of us have many more messages in our inbox! Let’s imagine we had a table that represented our inbox:

email = table: sender, recipient, subject row: 'Matthias Felleisen', 'Pedro Diaz', 'Introduction' row: 'Joe Politz', 'Pedro Diaz', 'Class on Friday' row: 'Matthias Felleisen', 'Pedro Diaz', 'Book comments' row: 'Mia Minnes', 'Pedro Diaz', 'CSE8A Midterm' end

and we wanted to get a table of just the messages from Matthias.

We can keep rows from a table as follows:

sieve email using sender:
  sender == 'Matthias Felleisen'
end

says to use the email table, and specifically to employ the sender column. This operation processes every row of the table. In each row, sender refers to the value of the sender column of that row. The expression in the body (between : and end) must evaluate to a Boolean; if it is true, then Pyret keeps that row in the resulting table, otherwise it is discarded. The outcome of running this query is a fresh table with the same columns but only some (perhaps as few as none) of the rows; those rows that remain will be in the same order as in the original table.

In the same way, we can keep rows based on the artist:

sieve playlist using artist:
  (artist == 'Deep Purple') or (artist == 'Van Halen')
end

This shows that we can write complex expressions to select rows.

Exercise
Write a table for to use as playlist that works with the sieve expression above.

Exercise
Write a sieve expression on the email table above that would result in a table with zero rows.

4.2.2 Ordering

We can similarly order the rows of a table, which produces a new table that has the rows in the described order:

order playlist: play-count ascending end

orders the rows with the play-count values in ascending order, so that the earliest rows in the table tell us which songs we’ve listened to least frequently.

Note that what goes between the : and end is not an expression. Therefore, we cannot write arbitrary code here. We can only name columns and indicate which way they should be ordered.

4.2.3 Combining Keeping and Ordering

Naturally, we are not limited to performing only one of these operations. Since each of them consumes a table and produces one, we can easily combine them. Let’s first think of what we might want to do in English:

Of the emails from a particular person, which is the oldest?
Of the songs by a particular artist, which have we played the least often?

Do Now!
Take a moment to think about how you would write these with what you have seen so far.

Here is the first example:

mf-emails = sieve email using sender:
  sender == 'Matthias Felleisen'
end
order mf-emails:
  sent-date ascending
end

Note that in the order expression, we order not email, which is the table of all emails, but only mf-email, the table of just the emails from that one sender. Now, looking at the earliest rows in the result gives us the earliest emails from that one person.

Exercise
Write the second example as a composition of keep and order operations on a playlist table.

4.2.4 Extending

Sometimes, we want to create a new column whose value is based on those of existing columns. For instance, our table might reflect employee records, and have columns named hourly-wage and hours-worked, representing the corresponding quantities. We would now like to extend this table with a new column to reflect each employee’s total wage:

extend employees using hourly-wage, hours-worked: total-wage: hourly-wage * hours-worked end

This creates a new column, total-wage, whose value in each row is the product of the two named columns in that row. Pyret will put the new column at the right end; as we will soon see, we can easily change the order of columns (Selecting).

Naturally, we can combine extension with other table operations. For instance, we might have noticed that messages with short subject lines usually don’t contain high-priority tasks. Therefore, we might first extend the email table with the length of the subject line:

ext-email = extend email using subject: subject-length: string-length(subject) end order ext-email: subject-length descending end

This will create a table where the longest subject lines are at the top and the shortest subject lines are at the bottom.

4.2.5 Transforming, Cleansing, and Normalizing

There are times when a table is “almost right”, but requires a little adjusting. For instance, we might have a table of customer requests for a free sample, and want to limit each customer to at most a certain number. We might get temperature readings from different countries in different formats, and want to convert them all to one single format.Because unit errors can be dangerous! We might have a gradebook where different graders have used different levels of precision, and want to standardize all of them to have the same level of precision.

In all these cases, we want the resulting table to have the same “shape” as the original—the same columns, the same rows, in the same order—but with some of the column values transformed slightly. Pyret provides transform to do this. For instance, here is how we limit customer orders:

transform orders using count:
  count: num-min(count, 3)
end

Here’s how we round the total grades:

transform gradebook using total-grade:
  total-grade: num-round(total-grade)
end

Of course, a transformation can involve columns other than the one being transformed:

transform weather using temp, unit:
  temp:
    if unit == "F":
      fahrenheit-to-celsius(temp)
    else:
      temp
    end
  unit:
    if unit == "F":
      "C"
    else:
      unit
    end
end

This alters the table so that all temperatures are converted to celsius.

Do Now!
In this example, why do we also transform unit?

It’s because we should keep the temperature and unit in sync. If we transform the temperature but not the unit, a later user of this table might assume the unit column is accurate, and accidentally treat the converted temperature as if it were still in Fahrenheit.

4.2.6 Selecting

Finally, for presentation purposes, it is sometimes useful to see just a few of the columns, especially in tables with many of them; it can also be helpful to change the order of columns so that items that are meant to be viewed together are made adjacent. Suppose our gradebook has numerous columns representing all the intermediate scores, at the end of which is the total; when we’re done assigning grades, we want to see each student’s name with just their final score:

select name, total-grade from gradebook end

Again, we can combine this operation with others. For instance, we may want to see just the artists and songs in our playlist, sorted in order by the artist’s name:

ss = select artist, song from playlist end order ss: artist ascending end

4.2.7 Summary of Row-Wise Table Operations

We’ve seen a lot in a short span. Specifically, we have seen several operations that consume a table and produce a new one according to some criterion. It’s worth summarizing the impact each of them has in terms of key table properties (where “-” means the entry is left unchanged):

Operation	Cell contents	Row order	Number of rows	Column order	Number of columns
Keeping	-	-	reduced	-	-
Ordering	-	changed	-	-	-
Extending	existing unchanged, new computed	-	-	-	augmented
Transforming	altered	-	-	-	-
Selecting	-	-	-	changed	reduced

The italicized entries reflect how the new table may differ from the old. Note that an entry like “reduced” or “altered” should be read as potentially reduced or altered; depending on the specific operation and the content of the table, there may be no change at all. (For instance, if a table is already sorted according to the criterion given in an order expression, the row order will not change.) However, in general one should expect the kind of change described in the above grid.

Observe that both dimensions of this grid provide interesting information. Unsurprisingly, each row has at least some kind of impact on a table (otherwise the operation would be useless and would not exist). Likewise, each column also has at least one way of impacting it. Furthermore, observe that most entries leave the table unchanged: that means each operation has limited impact on the table, careful to not overstep the bounds of its mandate.

On the one hand, the decision to limit the impact of each operation means that to achieve complex tasks, we may have to compose several operations together. We have already seen examples of this earlier this chapter. However, there is also a much more subtle consequence: it also means that to achieve complex tasks, we can compose several operations and get exactly what we want. If we had fewer operations that each did more, then composing them might have various undesired or (worse) unintended consequences, making it very difficult for us to obtain exactly the answer we want. Instead, the operations above follow the principle of orthogonality: no operation shadows what any other operation does, so they can be composed freely.

As a result of having these operations, we can think of tables also algebrically. Concretely, when given a problem, we should again begin with concrete examples of what we’re starting with and where we want to end. Then we can ask ourselves questions like, “Does the number of columns stay the same, grow, or shrink?”, “Does the number of rows stay the same or shrink?”, and so on. The grid above now provides us a toolkit by which we can start to decompose the task into individual operations. Of course, we still have to think: the order of operations matters, and sometimes we have to perform an operation mutiple times. Still, this grid is a useful guide to hint us towards the operations that might help solve our problem.

← prev up next →

1	Introduction
2	Basic Data and Expressions
3	From Repeated Expressions to Functions
4	Introduction to Tabular Data
5	From Tables to Lists
6	Processing Lists
7	Introduction to Structured Data
8	Collections of Structured Data
9	Recursive Data
10	Tree-Shaped Data [EMPTY]
11	Interactive Games as Reactive Systems
12	Examples, Testing, and Program Checking
13	Functions as Data
14	Predicting Growth
15	Sets Appeal
16	[EMPTY]
17	Halloween Analysis
18	Sharing and Equality
19	Graphs
20	State, Change, and More Equality
21	Algorithms That Exploit State
22	[EMPTY]
23	Processing Programs: Parsing
24	Processing Programs: A First Look at Interpretation
25	Interpreting Conditionals
26	Interpreting Functions
27	Reasoning about Programs: A First Look at Types
28	Safety and Soundness
29	Parametric Polymorphism
30	Type Inference
31	Mutation: Structures and Variables
32	Objects: Interpretation and Types
33	Control Operations
34	Pyret for Racketeers and Schemers
35	Glossary