Previously [Introduction to Tabular Data] we began to process collective data in the form of tables. Though we saw several powerful operations that let us quickly and easily ask sophisticated questions about our data, they all had two things in commmon. First, all were operations by rows. None of the operations asked questions about an entire column at a time. Second, all the operations not only consumed but also produced tables. However, we already know [Getting Started] there are many other kinds of data, and sometimes we will want to compute one of them. We will now see how to achieve both of these things, introducing an important new type of data in the process.
The most-played song in a playlist, which translates to the maximum value in a column of play counts.
The largest file in a filesystem, which translates to the maximum value in a column of file sizes.
The shortest person in a table of people, which translates to the smallest value in a column of heights.
The number of songs in a playlist. (This is arguably a question about all the columns combined, not any one specific column, since they all have the same number of entries.)
All the distinct entries in the play-counts column. (This, naturally, is a question about a specific column, because the number of distinct entries will differ depending on the column.)
The number of distinct entries in the play-counts column.
The average in a column of wages.
Other statistics (the median, mode, standard deviation, etc.) in a column of heights.
Think about whether and how you would express these questions with the operations you have already seen.
songs = table: title, artist, play-count row: "Harry Styles", "Adore You", 0 row: "Blinding Lights", "The Weeknd", 5 row: "Memories", "Maroon 5", 97 row: "The Box", "Roddy Ricch", 25 end select play-count from songs end
In principle, we could have a collection of operations on a single column. In some languages that focus solely on tables, such as SQL, this is what you’ll find. However, in Pyret we have many more kinds of data than just columns (as we’ll soon see [Introduction to Structured Data], we can even create our own!), so it makes sense to leave the gentle cocoon of tables sooner or later. An extracted column is a more basic kind of datum called a list, which can be used to represent data in programs without the bother of having to create a table every single time.
extract play-count from songs end
The elements have an order, so it makes sense to talk about the “first”, “second”, “last”—
and so on— element of a list.
All elements of a list are expected to have the same type.
This might sound rather abstract—
This genericity is both a virtue and a problem. Because, like other anonymous data, a list does not provide any interpretation of its use, if we are not careful we can accidentally mis-interpret the values. On the other hand, it means we can use the same datum in several different contexts, and one operation can be used in many settings.
Indeed, if we look at the list of questions we asked earlier, we see
that there are several common operations—
[list: 1, 2, 3] [list: -1, 5, 2.3, 10] [list: "a", "b", "c"] [list: "This", "is", "a", "list", "of", "words"]
shopping-list = [list: "muesli", "fiddleheads"]
Based on these examples, can you figure out how to create an empty list?
As you might have guessed, it’s [list: ] (the space isn’t necessary, but it’s a useful visual reminder of the void).
include math include statistics
max computes the maximum element of a list.
min computes the minimum element of a list.
mean computes the average of a list.
stdev computes the standard deviation of the values in list.
pcs = extract play-count from songs end most-played-count = max(pcs) least-played-count = min(pcs)
hts = extract height from people end tallest-height = max(hts) shortest-height = min(hts)
Design a table that has three people and would produce 78 and 42 for the tallest-height and shortest-height with the example code above.
Note that the questions we originally asked were slightly different: we didn’t ask for the tallest height but the tallest person, or likewise the most most-played song. Because we’ve stripped the heights and counts of their surrounding context, we can no longer tell which person or song these values correspond to. For that, we have to go back to the table.
Do you see how we can use the values above, like most-played-count or shortest-height, to obtain the corresponding songs or people?
most-played-songs = sieve songs using play-count: play-count == most-played-count end tallest-people = sieve people using height: height == tallest-height end
pcs = extract play-count from songs end most-played-count = max(pcs) sieve songs using play-count: play-count == most-played-count end
hts = extract height from people end tallest-height = max(hts) sieve people using height: height == tallest-height end
Implement all the other statistical questions posed in Basic Statistical Questions.
Until now we’ve only seen how to use built-in functions over lists. Next [Processing Lists], we will study how to create our own functions that process lists. Once we learn that, these list processing functions will remain powerful but will no longer seem quite so magical, because we’ll be able to build them for ourselves!