5From Tables to Lists

Previously [Introduction to Tabular Data] we began to process collective data in the form of tables. Though we saw several powerful operations that let us quickly and easily ask sophisticated questions about our data, they all had two things in commmon. First, all were operations by rows. None of the operations asked questions about an entire column at a time. Second, all the operations not only consumed but also produced tables. However, we already know [Basic Data and Expressions] there are many other kinds of data, and sometimes we will want to compute one of them. We will now see how to achieve both of these things, introducing an important new type of data in the process.

5.1Basic Statistical Questions

There are many more questions we might want to ask of our data. For instance:
• The most-played song in a playlist, which translates to the maximum value in a column of play counts.

• The largest file in a filesystem, which translates to the maximum value in a column of file sizes.

• The shortest person in a table of people, which translates to the smallest value in a column of heights.

• The most-played song in a playlist, which translates to the largest value in a column of play counts.

• The number of songs in a playlist. (This is arguably a question about all the columns combined, not any one specific column, since they all have the same number of entries.)

• All the distinct entries in the play-counts column. (This, naturally, is a question about a specific column, because the number of distinct entries will differ depending on the column.)

• The number of distinct entries in the play-counts column.

• The average in a column of wages.

• Other statistics (the median, mode, standard deviation, etc.) in a column of heights.

Notice the kinds of operations that we are talking about: computing the maximum, minimum, average, median, and other basic statistics.Pyret has several built-in statistics functions in the math and statistics packages.

Do Now!

Think about whether and how you would express these questions with the operations you have already seen.

5.2Extracting a Column from a Table

Hopefully you found select attractive, because it gives us a column in isolation: e.g.,
select play-count from songs end
select height from people end
But in the end we’re still stuck with a column in a table, and none of the other operations let us compute the answers we’re looking for. Therefore, there is no (straightforward) way to express these questions at all, because they require us to be perform a computation looking at the values of a table relative to one another, rather than in isolation.

In principle, we could have a collection of operations on a single column. In some languages that focus solely on tables, such as SQL, this is what you’ll find. However, in Pyret we have many more kinds of data than just columns (as we’ll soon see [REF], we can even create our own!), so it makes sense to leave the gentle cocoon of tables sooner or later. An extracted column is a more basic kind of datum called a list, which can be used to represent other data in programs (as we see in [REF] and elsewhere), without the bother of having to create a table every single time.

Therefore, we introduce one more operation, extract, which takes a column name and gives just the content of that one column:
extract play-count from songs end
extract height from people end
And now we can answer the critical question—what is the difference between select and extractby saying that while select produces a table, extract produces a list.

5.3Understanding Lists

A list has much in common with a single-column table:
• The elements have an order, so it makes sense to talk about the “first”, “second”, “last”—and so on—element of a list.

• All elements of a list are expected to have the same type.

The crucial difference is that a list does not have a “column name”; it is anonymous. That is, by itself a list does not describe what it represents; this interpretation is done by our program.

5.3.1Lists as Anonymous Data

This might sound rather abstract—and it is—but this isn’t actually a new idea in our programming experience. Consider a value like 3 or -1: what is it? It’s the same sort of thing: an anonymous value that does not describe what it represents; the interpretation is done by our program. In one setting 3 may represent an age, in another a play count; in one setting -1 may be a temperature, in another the average of several temperatures. Similarly with a string: Is "project" a noun (an activity that one or more people perform) or a verb (as when we display something on a screen)? Likewise with images and so on. In fact, tables have been the exception so far in having description built into the data rather than being provided by a program!

This genericity is both a virtue and a problem. Because, like other anonymous data, a list does not provide any interpretation of its use, if we are not careful we can accidentally mis-interpret the values. On the other hand, it means we can use the same datum in several different contexts, and one operation can be used in many settings.

Indeed, if we look at the list of questions we asked earlier, we see that there are several common operations—maximum, minimum, average, and so on—that can be asked of a list of values without regard for what the list represents (heights, ages, playcounts). In fact, some are specific to numbers (like average) while some (like maximum) can be asked of any type on which we can perform a comparison (like strings).

5.3.2Creating Literal Lists

We have already seen how we can create lists from a table, using extract. As you might expect, however, we can also create lists directly:
[list: 1, 2, 3]
[list: -1, 5, 2.3, 10]
[list: "a", "b", "c"]
[list: "This", "is", "a", "list", "of", "words"]
Of course, lists are values so we can name them using variables—
shopping-list = [list: "muesli", "fiddleheads"]
pass them to functions (as we will soon see), and so on.

Do Now!

Based on these examples, can you figure out how to create an empty list?

As you might have guessed, it’s [list: ] (the space isn’t necessary, but it’s a useful visual reminder of the void).

5.4Operating on Lists

5.4.1Built-In Operations on Lists

Pyret handily provides a useful set of operations we can already perform on lists. As you might have guessed, we can already compute most of the answers we’ve asked for above. First we need to include some libraries that contain useful functions:
include math
include statistics
We can then access several useful functions:
• max computes the maximum element of a list.

• min computes the minimum element of a list.

• mean computes the average of a list.

• stdev computes the standard deviation of the values in list.

Thus:
pcs = extract play-count from songs end
most-played-count = max(pcs)
least-played-count = min(pcs)

hts = extract height from people end
tallest-height = max(hts)
shortest-height = min(hts)

5.4.2Combining Lists and Tables

Note that the questions we originally asked were slightly different: we didn’t ask for the tallest height but the tallest person, or likewise the most most-played song. Because we’ve stripped the heights and counts of their surrounding context, we can no longer tell which person or song these values correspond to. For that, we have to go back to the table.

Do Now!

Do you see how we can use the values above, like most-played-count or shortest-height, to obtain the corresponding songs or people?

The key is to write a query over the corresponding table that refers to this value. For instance:
tallest-people = sieve people using height:
height = tallest-height
end

most-played-songs = sieve playlist using play-count:
play-count = most-played-count
end
There’s a reason we are careful to always use the plural—people, songsrather than the singular. This is because we cannot be sure there is only one person or one song with this height or play count. That is, there is a single biggest or smallest value in the list, because the value has no other information about it (so the same height coming from two different people, or the same play count coming from two different songs, looks the same in the list). But when put back in the context of the original table, the other values may be different.

In short, our overall answer is computed quite simply:
pcs = extract play-count from songs end
most-played-count = max(pcs)
keep-if playlist using play-count:
play-count = most-played-count
end
and
hts = extract height from people end
tallest-height = max(hts)
keep-if people using height:
height = tallest-height
end

Exercise

Implement all the other statistical questions posed in Basic Statistical Questions.

Until now we’ve only seen how to use built-in functions over lists. Next [Processing Lists], we will study how to create our own functions that process lists. Once we learn that, these list processing functions will remain powerful but will no longer seem quite so magical, because we’ll be able to build them for ourselves!