4 Introduction to Tabular Data
An email inbox is a list of messages. For each message, your inbox stores a bunch of information: its sender, the subject line, the conversation it’s part of, the body, and quite a bit more.
A music playlist. For each song, your music player maintains a bunch of information: its name, the singer, its length, its genre, and so on.
A filesystem folder or directory. For each file, your filesystem records a name, a modification date, size, and other information.
Do Now!
Can you come up with more examples?
Responses to a party invitation.
A gradebook.
A calendar agenda.
They consists of rows and columns. For instance, each song or email message or file is a row. Each of their characteristics—
the song title, the message subject, the filename— is a column. Each row has the same columns as the other rows, in the same order.
A given column has the same type, but different columns can have different types. For instance, an email message has a sender’s name, which is a string; a subject line, which is a string; a sent date, which is a date; whether it’s been read, which is a Boolean; and so on.
The rows are usually in some particular order. For instance, the emails are ordered by which was most recently sent.
Exercise
Find the characteristics of tabular data in the other examples described above, as well as in the ones you described.
We will now learn how to program with tables and to think about decomposing tasks involving them.You can also look up the full Pyret documentation for table operations.
4.1 Creating Tabular Data
table: name, age row: "Alice", 30 row: "Bob", 40 row: "Carol", 25 end
Exercise
Change different parts of the above example—
e.g., remove a necessary value from a row, add an extraneous one, remove a comma, add an extra comma, leave an extra comma at the end of a row— and see what errors you get.
check: table: name, age row: "Alice", 30 row: "Bob", 40 row: "Carol", 25 end is-not table: age, name row: 30, "Alice" row: 40, "Bob" row: 25, "Carol" end end
create the sheet on your own,
create a sheet collaboratively with friends,
find data on the Web that you can import into a sheet,
create a Google Form that you get others to fill out, and obtain a sheet out of their responses
4.2 Processing Rows
Let’s now learn how we can actually process a table. Pyret offers a variety of built-in operations that make it quite easy to perform interesting computations over tables. In addition, as we will see later [REF], if we don’t find these sufficient, we can write our own. For now, we’ll focus on the operations Pyret provides.
Which emails were sent by a particular user?
Which songs were sung by a particular artist?
Which are the most frequently played songs in a playlist?
Which are the least frequently played songs in a playlist?
4.2.1 Keeping
sieve email using sender: sender == 'Matthias Felleisen' end
sieve playlist using artist: (artist == 'Deep Purple') or (artist == 'Van Halen') end
4.2.2 Ordering
order playlist: play-count ascending end
Note that what goes between the : and end is not an expression. Therefore, we cannot write arbitrary code here. We can only name columns and indicate which way they should be ordered.
4.2.3 Combining Keeping and Ordering
Of the emails from a particular person, which is the oldest?
Of the songs by a particular artist, which have we played the least often?
Do Now!
Take a moment to think about how you would write these with what you have seen so far.
mf-emails = sieve email using sender: sender == 'Matthias Felleisen' end order mf-emails: sent-date ascending end
Exercise
Write the second example as a composition of keep and order operations on a playlist table.
4.2.4 Extending
extend employees using hourly-wage, hours-worked: total-wage: hourly-wage * hours-worked end
ext-email = extend email using subject: subject-length: string-length(subject) end order ext-email: subject-length descending end
4.2.5 Transforming, Cleansing, and Normalizing
There are times when a table is “almost right”, but requires a little adjusting. For instance, we might have a table of customer requests for a free sample, and want to limit each customer to at most a certain number. We might get temperature readings from different countries in different formats, and want to convert them all to one single format.Because unit errors can be dangerous! We might have a gradebook where different graders have used different levels of precision, and want to standardize all of them to have the same level of precision.
transform orders using count: count: num-min(count, 3) end
transform gradebook using total-grade: total-grade: num-round(total-grade) end
transform weather using temp, unit: temp: if unit == "F": fahrenheit-to-celsius(temp) else: temp end unit: if unit == "F": "C" else: unit end end
Do Now!
In this example, why do we also transform unit?
4.2.6 Selecting
select name, total-grade from gradebook end
ss = select artist, song from playlist end order ss: artist ascending end
4.2.7 Summary of Row-Wise Table Operations
We’ve seen a lot in a short span. Specifically, we have seen several operations that consume a table and produce a new one according to some criterion. It’s worth summarizing the impact each of them has in terms of key table properties (where “-” means the entry is left unchanged):
Operation |
| Cell contents |
| Row order |
| Number of rows |
| Column order |
| Number of columns |
Keeping |
| - |
| - |
| reduced |
| - |
| - |
Ordering |
| - |
| changed |
| - |
| - |
| - |
Extending |
| existing unchanged, new computed |
| - |
| - |
| - |
| augmented |
Transforming |
| altered |
| - |
| - |
| - |
| - |
Selecting |
| - |
| - |
| - |
| changed |
| reduced |
The italicized entries reflect how the new table may differ from the old. Note that an entry like “reduced” or “altered” should be read as potentially reduced or altered; depending on the specific operation and the content of the table, there may be no change at all. (For instance, if a table is already sorted according to the criterion given in an order expression, the row order will not change.) However, in general one should expect the kind of change described in the above grid.
Observe that both dimensions of this grid provide interesting information. Unsurprisingly, each row has at least some kind of impact on a table (otherwise the operation would be useless and would not exist). Likewise, each column also has at least one way of impacting it. Furthermore, observe that most entries leave the table unchanged: that means each operation has limited impact on the table, careful to not overstep the bounds of its mandate.
On the one hand, the decision to limit the impact of each operation means that to achieve complex tasks, we may have to compose several operations together. We have already seen examples of this earlier this chapter. However, there is also a much more subtle consequence: it also means that to achieve complex tasks, we can compose several operations and get exactly what we want. If we had fewer operations that each did more, then composing them might have various undesired or (worse) unintended consequences, making it very difficult for us to obtain exactly the answer we want. Instead, the operations above follow the principle of orthogonality: no operation shadows what any other operation does, so they can be composed freely.
As a result of having these operations, we can think of tables also algebrically. Concretely, when given a problem, we should again begin with concrete examples of what we’re starting with and where we want to end. Then we can ask ourselves questions like, “Does the number of columns stay the same, grow, or shrink?”, “Does the number of rows stay the same or shrink?”, and so on. The grid above now provides us a toolkit by which we can start to decompose the task into individual operations. Of course, we still have to think: the order of operations matters, and sometimes we have to perform an operation mutiple times. Still, this grid is a useful guide to hint us towards the operations that might help solve our problem.