On this page:
7.1 Creating Tabular Data
7.2 Extracting Rows and Cell Values
7.3 Functions over Rows
7.4 Processing Rows
7.4.1 Finding Rows
7.4.2 Ordering Rows
7.4.3 Adding New Columns
7.4.4 Calculating New Column Values
7.5 Examples for Table-Producing Functions

7 Introduction to Tabular Data

    7.1 Creating Tabular Data

    7.2 Extracting Rows and Cell Values

    7.3 Functions over Rows

    7.4 Processing Rows

      7.4.1 Finding Rows

      7.4.2 Ordering Rows

      7.4.3 Adding New Columns

      7.4.4 Calculating New Column Values

    7.5 Examples for Table-Producing Functions

Many interesting data in computing are tabulari.e., like a table—in form. First we’ll see a few examples of them, before we try to identify what they have in common. Here are some of them:
  • An email inbox is a list of messages. For each message, your inbox stores a bunch of information: its sender, the subject line, the conversation it’s part of, the body, and quite a bit more.

  • A music playlist. For each song, your music player maintains a bunch of information: its name, the singer, its length, its genre, and so on.

  • A filesystem folder or directory. For each file, your filesystem records a name, a modification date, size, and other information.

Do Now!

Can you come up with more examples?

How about:
  • Responses to a party invitation.

  • A gradebook.

  • A calendar agenda.

You can think of many more in your life!

What do all these have in common? The characteristics of tabular data are:
  • They consists of rows and columns. For instance, each song or email message or file is a row. Each of their characteristics—the song title, the message subject, the filename—is a column.

  • Each row has the same columns as the other rows, in the same order.

  • A given column has the same type, but different columns can have different types. For instance, an email message has a sender’s name, which is a string; a subject line, which is a string; a sent date, which is a date; whether it’s been read, which is a Boolean; and so on.

  • The rows might be in some particular order. For instance, the emails are ordered by which was most recently sent.

Exercise

Find the characteristics of tabular data in the other examples described above, as well as in the ones you described.

We will now learn how to program with tables and to think about decomposing tasks involving them.You can also look up the full Pyret documentation for table operations.

7.1 Creating Tabular Data

Pyret provides multiple easy ways of creating tabular data. The simplest is to define the datum in a program as follows:

table: name, age row: "Alice", 30 row: "Bob", 40 row: "Carol", 25 end

That is, a table is followed by the names of the columns in their desired order, followed by a sequence of rows. Each row must contain as many data as the column declares, and in the same order.

Exercise

Change different parts of the above example—e.g., remove a necessary value from a row, add an extraneous one, remove a comma, add an extra comma, leave an extra comma at the end of a row—and see what errors you get.

Note that in a table, the order of columns matters: two tables that are otherwise identical but with different column orders are not considered equal.

check: table: name, age row: "Alice", 30 row: "Bob", 40 row: "Carol", 25 end is-not table: age, name row: 30, "Alice" row: 40, "Bob" row: 25, "Carol" end end

Observe that the example above uses is-not, i.e., the test passes, meaning that the tables are not equal.

Table expressions create table values. These can be stored in variables just like numbers, strings, and images:

people = table: name, age row: "Alice", 30 row: "Bob", 40 row: "Carol", 25 end

We call these literal tables when we create them with table. Pyret provides other ways to get tabular data, too! In particular, you can import tabular data from a spreadsheet, so any mechanism that lets you create such a sheet can also be used. You might:
  • create the sheet on your own,

  • create a sheet collaboratively with friends,

  • find data on the Web that you can import into a sheet,

  • create a Google Form that you get others to fill out, and obtain a sheet out of their responses

and so on. Let your imagination run wild! Once the data are in Pyret, the language doesn’t care where they came from.

With tables, we begin to explore data that contain other (smaller) pieces of data. We’ll refer to such data as compound data. Compound data organizes its inner data in a structured way (here, rows and columns). As with images, when we wrote code that reflected the structure of the final image, we will see that code that works with tables also follows the structure of the data.

7.2 Extracting Rows and Cell Values

Given a table, we sometimes want to look up the value of a particular cell. We’ll work with the following table showing the number of riders on a shuttle service over several months:

shuttle = table: month, riders row: "Jan", 1123 row: "Feb", 1045 row: "Mar", 1087 row: "Apr", 999 end

Let’s say we need to extract the number of riders in March (1087) so we can use it in another computation. How do we do that?

Pyret (and most other programming languages designed for data analysis) organize tables as collections of rows with shared columns. Given that organization, we get to a specific cell by first isolating the row we are interested in, then retrieving the contents of the cell.

Pyret numbers the rows of a table from top to bottom starting at 0 (most programming languages use 0 as the first position in a piece of data, for reasons we will see later). So if we want to see the data for March, we need to isolate row 2. We write:

shuttle.row-n(2)

We use the period notation to dig into a piece of compound data. Here, we are saying "dig into the shuttle table, extracting row number 2" (which is really the third row since Pyret counts positions from 0).

If we run this expression at the prompt, we get

This is a new type of data called a Row. When Pyret displays a Row value, it shows you the column names and the values of the associated cells.

To extract the value of a specific column within a row, we write the row followed by the name of the column (as a string) in square brackets. Here are two equivalent ways of getting the value of the riders column from the row for March:

shuttle.row-n(2)["riders"]

march-row = shuttle.row-n(2) march-row["riders"]

Do Now!

How do each of these versions interact with the program directory?

Once we have the cell value (here a Number), we can use it in any other computation, such as

shuttle.row-n(2)["riders"] >= 1000

(which checks whether there were at least 1000 riders in March).

Do Now!

What do you expect would happen if you forgot the quotation marks and instead wrote:

shuttle.row-n(2)[riders]

What would Pyret do and why?

7.3 Functions over Rows

Now that we have the ability to isolate Rows from tables, we can write functions that ask questions about individual rows. We just saw an example of doing a computation over row data, when we checked whether the row for March had more than 1000 riders. What if we wanted to do this comparison for an arbitrary row of this table? Let’s write a function! We’ll call it cleared-1K.

Let’s start with a function header and some examples:

fun cleared-1K(r :: Row) -> Boolean: doc: "determine whether given row has at least 1000 riders" ... where: cleared-1K(shuttle.row-n(2)) is true cleared-1K(shuttle.row-n(3)) is false end

This shows you what examples for Row functions look like, as well as how we use Row as an input type.

To fill in the body of the function, we extract the content of the "riders" cell and compare it to 1000:

fun cleared-1K(r :: Row) -> Boolean: doc: "determine whether given row has at least 1000 riders" r["riders"] >= 1000 where: cleared-1K(shuttle.row-n(2)) is true cleared-1K(shuttle.row-n(3)) is false end

Do Now!

Looking at the examples, both of them share the shuttle.row-n portion. Would it have been better to instead make cleared-1K a function that takes just the row position as input, such as:

fun cleared-1K(row-pos :: Number) -> Boolean: ... where: cleared-1K(2) is true cleared-1K(3) is false end

What are the benefits and limitations to doing this?

In general, the version that takes the Row input is more flexible because it can work with a row from any table that has a column named "riders". We might have another table with more columns of information or different data tables for different years. If we modify cleared-1K to only take the row position as input, that function will have to fix which table it works with. In contrast, our original version leaves the specific table (shuttle) outside the function, which leads to flexibility.

Exercise

Write a function is-winter that takes a Row with a "month" column as input and produces a Boolean indicating whether the month is that row is one of "Jan", "Feb", or "Mar".

Exercise

Write a function low-winter that takes in Row with both "month" and "riders" columns and produces a Boolean indicating whether the row is a winter row with fewer than 850 riders.

Exercise

Practice with the program directory! Take a Row function and one of its where examples, and show how the program directory evolves as you evaluate the example.

7.4 Processing Rows

So far, we have looked at extracting individual rows by their position in the table and computing over them. Extracting rows by position isn’t always convenient: we might have hundreds or thousands of rows, and we might not know where the data we want even is in the table. We would much rather be able to write a small program that identifies the row (or rows!) that meets a specific criterion.

Pyret offers three different notations for processing tables: one uses functions, one uses methods, and one uses a SQL-like notation. This chapter uses the function-based notation. The SQL-like notation and the methods-based notation are shown in the Pyret Documentation. To use the function-based notation, you’ll need to include the file specified in the main narrative.

The rest of this section assumes that you have loaded the functions notation for working with tables, using the following line in your Pyret file:

include shared-gdrive( "cs111-2020.arr", "1imMXJxpNWFCUaawtzIJzPhbDuaLHtuDX")

7.4.1 Finding Rows

Imagine that we wanted to write a program to locate the row that had fewer than 1000 riders from our shuttle table. With what we’ve studied so far, how might we try to write this? We could imagine using a conditional, like follows:

if shuttle.row-n(0)["riders"] < 1000: shuttle.row-n(0) else if shuttle.row-n(1)["riders"] < 1000: shuttle.row-n(1) else if shuttle.row-n(2)["riders"] < 1000: shuttle.row-n(2) else if shuttle.row-n(3)["riders"] < 1000: shuttle.row-n(3) else: ... # not clear what to do here end

Do Now!

Do you like this approach? Why or why not?

There are a couple of reasons why we might not care for this solution. First, if we have thousands of rows, this will be terribly painful to write. Second, there’s a lot of repetition here (only the row positions are changing). Third, it isn’t clear what to do if there aren’t any matching rows. In addition, what happens if there are multiple rows that meet our criterion? In some cases, we might want to be able to identify all of the rows that meet a condition and gather them up someone (such as searching for all data for winter months, or all email messages sent by a particular user).

This conditional is, however, the spirit of what we want to do: go through the rows of the table one at a time, identifying those that match some criterion. We just don’t want to be responsible for manually checking each row. Fortunately for us, Pyret knows how to do that. Pyret knows how many rows are in a given table. Pyret can pull out those rows one position at a time and check a criterion about each one.

We just need to tell Pyret what criterion we want to use.

As before, we can express our criterion as a function that takes a Row and produces a Boolean (a Boolean because our criterion was used as the question part of an if expression in our code sketch). In this case, we want:

fun below-1K(r :: Row) -> Boolean: doc: "determine whether row has fewer than 1000 riders" r["riders"] < 1000 where: below-1K(shuttle.row-n(2)) is false below-1K(shuttle.row-n(3)) is true end

Now, we just need a way to tell Pyret to use this criterion as it searches through the rows. We do this with a function called filter-with which takes two inputs: the table to process and the criterion to check on each row of the table:

filter-with(shuttle, below-1K)

Under the hood, filter-with works roughly like the if statement we outlined above: it takes each row one at a time and calls the given criterion function on it. But what does it do with the results?

If you actually run the above expression, we see that filter-with produces a table containing the matching row, not the row by itself. This behavior is handy if multiple rows match the criterion. For example, try:

filter-with(shuttle, is-winter)

Now we get a table with the two rows corresponding to winter months. If we want to be able to name this table for use in future computations, we can do so with our usual notation for naming values:

winter = filter-with(shuttle, is-winter)

7.4.2 Ordering Rows

Let’s ask a new question: which winter month had the fewest number of riders?. This question requires us to identify a specific row, namely, the winter row with the smallest value in the "riders" column.

Do Now!

Can we do this with filter-with? Why or why not?

Think back to the if expression that motivated filter-with: each row is evaluated independently of the others. Our current question, however, requires comparing across rows. That’s a different operation, so we will need more than filter-with.

Tools for analyzing data (whether programming languages or spreadsheets) provide ways for users to sort rows of a table based on the values in a single column. That would help us here: we could sort the winter rows from smallest to largest value in the "riders" column, then extract the "riders" value from the first row. First, let’s sort the rows:

sort-by(winter, "riders", true)

The sort-by function takes three inputs: the table to sort (winter), the column to sort on ("riders"), and a Boolean to indicate whether we want to sort in increasing order. (Had the third argument been false, the rows would be sorted in decreasing order of the values in the named column.)

In the sorted table, the row with the fewest riders is in the first position. Our original question asked us to lookup the month with the fewest riders. We did this earlier.

Do Now!

Write the code to extract the name of the winter month with the fewest riders.

Here are two ways to write that computation:

sort-by(winter, "riders", true).row-n(0)["month"]

sorted = sort-by(winter, "riders", true) least-row = sorted.row-n(0) least-row["month"]

Do Now!

Which of these two ways do you prefer? Why?

Do Now!

How does each of these programs affect the program directory?

Note that this problem asked us to combine several actions that we’ve already seen on rows: we identify rows from within a table (filter-with), order the rows (sort-by), extract a specific row (row-n), then extract a cell (with square brackets and a column name). This is typical of how we will operate on tables, combining multiple operations to compute a result (much as we did with programs that manipulate images).

7.4.3 Adding New Columns

Sometimes, we want to create a new column whose value is based on those of existing columns. For instance, our table might reflect employee records, and have columns named hourly-wage and hours-worked, representing the corresponding quantities. We would now like to extend this table with a new column to reflect each employee’s total wage. Assume we started with the following table:

employees = table: name, hourly-wage, hours-worked row: "Harley", 15, 40 row: "Obi", 20, 45 row: "Anjali", 18, 39 row: "Miyako", 18, 40 end

The table we want to end up with is:

employees = table: name, hourly-wage, hours-worked, total-wage row: "Harley", 15, 40, 15 * 40 row: "Obi", 20, 45, 20 * 45 row: "Anjali", 18, 39, 18 * 39 row: "Miyako", 18, 40, 18 * 40 end

(with the expressions in the total-wage column computed to their numeric equivalents: we used the expressions here to illustrate what we are trying to do).

Previously, when we have had a computation that we performed multiple times, we created a helper function to do the computation.

Do Now!

Propose a helper function for computing total wages.

Perhaps you came up with something like:

fun compute-wages(wage :: Number, hours :: Number) -> Number: wage * hours end

which we could use as follows:

employees = table: name, hourly-wage, hours-worked, total-wage row: "Harley", 15, 40, compute-wages(15, 40) row: "Obi", 20, 45, compute-wages(20, 45) row: "Anjali", 18, 39, compute-wages(18, 39) row: "Miyako", 18, 40, compute-wages(18, 40) end

This is the right idea, but we can actually have this function do a bit more work for us. The wage and hours values are in cells within the same row. So if we could instead get the current row as an input, we could write:

fun compute-wages(r :: Row) -> Number: r["hourly-wage"] * r["hours-worked"] end employees = table: name, hourly-wage, hours-worked, total-wage row: "Harley", 15, 40, compute-wages(<row0>) row: "Obi", 20, 45, compute-wages(<row1>) row: "Anjali", 18, 39, compute-wages(<row2>) row: "Miyako", 18, 40, compute-wages(<row3>) end

But now, we are writing calls to compute-wages over and over! Adding computed columns is a sufficiently common operation that Pyret provides a table function called build-column for this purpose. We use it by providing the function to use to populate values in the new column as an input:

fun compute-wages(r :: Row) -> Number: doc: "compute total wages based on wage and hours worked" r["hourly-wage"] * r["hours-worked"] end build-column(employees, "total-wage", compute-wages)

This creates a new column, total-wage, whose value in each row is the product of the two named columns in that row. Pyret will put the new column at the right end.

7.4.4 Calculating New Column Values

Sometimes, we just want to calculate new values for an existing column, rather than create an entirely new column. Giving raises to employees is one such example. Assume we wanted to give a 10% raise to all employees making less than 20 an hour. We could write:

fun new-rate(rate :: Number) -> Number: doc: "Raise rates under 20 by 10%" if rate < 20: rate * 1.1 else: rate end where: new-rate(20) is 20 new-rate(10) is 11 new-rate(0) is 0 end fun give-raises(t :: Table) -> Table: doc: "Give a 10% raise to anyone making under 20" transform-column(t, "hourly-wage", new-rate) end

Here, transform-column takes a table, the name of an existing column in the table, and a function to update the value. The updating function takes the current value in the column as input and produces the new value for the column as output.

Do Now!

Run give-raises on the employees table. What wage will show for "Miyako" in the employees table after give-raises completes. Why?

Like all other Pyret Table operations, transform-column produces a new table, leaving the original intact. Editing the original table could be problematic–what if you made a mistake? How would you recover the original table in that case? In general, producing new tables with any modifications, then creating a new name for the updated table once you have the one you want, is a less error-prone way of working with datasets.

7.5 Examples for Table-Producing Functions

How do we write examples for functions that produce tables? Conceptually, the answer is simply "make sure you got the output table that you expected". Logistically, writing examples for table functions seems more painful because writing out an expected output tables is more work than simply writing the output of a function that produces numbers or strings. What can we do to manage that complexity?

Do Now!

How might you write the where block for give-raises?

Here are some ideas for writing the examples practically:

Another tip to keep in mind: when the only your function does is call a built-in function like transform-column it usually suffices to write examples for the function you wrote to compute the new column value. It is only when your code is combining table operations, or doing more complex processing than a single call to a built-in table operation that you really need to present your own examples to a reader of your code.