In data analysis, we often work with large datasets, some of which were collected by someone else. Datasets don’t necessarily come in a form that we can work with. We might need the raw data pulled apart or condensed to courser granularity. Some data might be missing or entered incorrectly. On top of that, we have to plan for long-term maintenance of our datasets or analysis programs. Finally, we typically want to use visualizations to either communicate our data or to check for issues with our data.
As a concrete example, assume that you are doing data analysis and support for a company that manages ticket sales for events. People purchase tickets through an online form. The form software creates a spreadsheet with all the entered data, which is what you have to work with. Here’s a screenshot of a sample spreadsheet:
Take a look at the table. What do you notice that might affect using the data in an analysis? Or for the operations for managing an event?
Some issues jump out quickly: the three in the "Num Tickets" column, differences in capitalization in the "Discount Code" column, and the use of each of "none" and blank spaces in the the "Discount Code" column (you may have spotted additional issues). Before we do any analysis with this dataset, we need to clean it up so that our analysis will be reliable. In addition, sometimes our data is clean, but it needs to be adjusted or prepared to fit the questions we want to ask. This chapter looks at both steps, and the programming techniques that are helpful for them.
If you want to load a csv file, first import it into a Google Sheet, then load it from the Google Sheet into Pyret. The first step to working with an outside data source is to load it into your programming and analysis environment. In Pyret, we do this using the load-table command, which loads tables from Google Sheets.
include gdrive-sheets ssid = "1DKngiBfI2cGTVEazFEyXf7H4mhl8IU5yv2TfZWv6Rc8" event-data = load-table: name, email, tickcount, discount, delivery source: load-spreadsheet(ssid).sheet-by-name("Orig Data", true) end
ssid is the identifier of the Google Sheet we want to load (the identifier is the long sequence of letters and numbers in the Google Sheet URL).
load-table says to create a Pyret table via loading. The sequence of names following load-table is used for the column headers in the Pyret version of the table. These do NOT have to match the names used in the Sheets version of the table.
source tells Pyret which sheet to load. The load-spreadsheet operation takes the Google Sheet identifier (here, ssid), as well as the name of the individual worksheet (or tab) as named within the Google Sheet (here, "Orig Data". The final boolean indicates whether there is a header row in the table (true means there is a header row).
When we try to run this code, Pyret complains about the three in the Num Tickets column: it was expecting a number, but instead found a string. Pyret expects all columns to hold values of the same type. When loading a table from file, Pyret bases the type of each column on the corresponding value in the first row of the table.
This is an example of a data error that we have to fix in the source file, rather than by using programs within Pyret. Not all languages will reject programs on loading. Languages embody philosophies of what programmers should expect from them. Some will try to make whatever the programmer provided work, while others will ask the programmer to fix issues upfront. Pyret tends more towards the latter philsophy, while relaxing it in some places (such as making types optional). Within the source Google Sheet for this chapter, there is a separate worksheet/tab named "Data" in which the three has been replaced with a number. If we use "Data" instead of "Orig Data" in the above load-spreadsheet command, the event table loads into Pyret.
Why might we have created a separate worksheet with the corrected data, rather than just correct the original sheet?
When we create tables manually in Pyret, we have to provide a value for each cell – there’s no way to "skip" a cell. When we create tables in a spreadsheet program (such as Excel, Google Sheets, or something similar), it is possible to leave cells completely empty. What happens when we load a table with empty cells into Pyret?
event-data = load-table: name, email, tickcount, discount, delivery source: load-spreadsheet(ssid).sheet-by-name("Data", true) end
The raw event-data file has a blank in the discount column. If we load the table and look at how it reads in, we find something new in that column:
Note that those cells that had discount codes in them now have an odd-looking notation like some("student"), while the cells that were empty contain none, but none isn’t a string. What’s going on?
Pyret supports a special type of data called Option. As the name suggests, option is for data that may or may not be present. none is the value that stands for "the data is missing". If there is data present, it appears wrapped in some.
Look at the discount value for Ernie’s row: it reads some("none"). What does this mean? How is this different from none (as in Sam’s row)?
In Pyret, the right way to address this is to tell Pyret how to handle missing values for each column, so that the data is as you expect after you read it in. We do this with an additional aspect of load-table called sanitizers. Here’s how we modify the code:
include data-source # to get the sanitizers event-data = load-table: name, email, tickcount, discount, delivery source: load-spreadsheet(ssid).sheet-by-name("Data", true) sanitize name using string-sanitizer sanitize email using string-sanitizer sanitize tickcount using num-sanitizer sanitize discount using string-sanitizer sanitize delivery using string-sanitizer end
Each of the sanitize lines tells Pyret what to do in the case of missing data in the respective column. string-sanitizer says to load missing data as an empty string (""). num-sanitizer says to load missing data as zero (0). The sanitizers also handle simple data conversions. If the string-sanitizer were applied to a column with a number (like 3), the sanitizer would convert that number to a string (like "3").
Wait – wouldn’t putting types on the columns (like discount :: String) in the load-table solve this problem? No, because the type isn’t enough to know which value should be the default! In some situations, you might want the default value to be something other than an empty string or 0. Sanitizers actually let you tailor this for yourself (a sanitizer is just a Pyret function: see the Pyret documentation for details on sanitizer inputs).
Rule of thumb: when you load a table, use a sanitizer to guard against errors if your code encounters missing data.
Next, let’s look at the "Discount Code" column. Our goal is to be able to accurately answer the question "How many orders were placing under each discount code". We would like to have the answer summarized in a table, where one column names the discount code and another gives a count of the rows that used that code.
Examples first! What table do we want from this computation on the fragment of table that we gave you?
How do we get to this table? How do we figure this out if we aren’t sure?
Start by looking in the tables documentation for any library functions that might help with this task. In the case of Pyret, we find:
# count(tab :: Table, colname :: String) -> Table # Produces a table that summarizes how many rows have # each value in the named column.
This sounds useful, as long as every column has a value in the "Discount code" column, and that the only values in the column are those in our desired output table. What do we need to do to achieve this?
Get "none" to appear in every cell that currently lacks a value
Convert all the codes that aren’t "none" to upper case
We can capture these together in a function that takes in and produces a string:
fun cell-to-discount-code(str :: String) -> String: doc: ```uppercase all strings other than none, convert blank cells to contain none``` if (str == "") or (str == "none"): "none" else: string-to-upper(str) end where: cell-to-discount-code("") is "none" cell-to-discount-code("none") is "none" cell-to-discount-code("birthday") is "BIRTHDAY" cell-to-discount-code("Birthday") is "BIRTHDAY" end
Assess the examples included with cell-to-discount-code. Is this a good set of examples, or are any key ones missing?
cell-to-discount-code("NoNe") is "none"
Why did the "NoNe" case fail?
fun cell-to-discount-code(str :: String) -> String: doc: ```uppercase all strings other than none, convert blank cells to contain none``` if (str == "") or (string-to-lower(str) == "none"): "none" else: string-to-upper(str) end where: cell-to-discount-code("") is "none" cell-to-discount-code("none") is "none" cell-to-discount-code("NoNe") is "none" cell-to-discount-code("birthday") is "BIRTHDAY" cell-to-discount-code("Birthday") is "BIRTHDAY" end
Try it yourself: normalize the "delivery" column so that all "yes" values are converted to "email".
Now that we’ve cleaned up the codes, we can proceed to using the "count" function to extract our summary table:
This produces the following table:
What’s with that first row, with the discount code " "? Where might that have come from?
Maybe you didn’t notice this before (or wouldn’t have noticed it within a larger table), but there must have been a cell of the source data with a string of blanks, rather than missing content. How do we approach normalization to avoid missing cases like this?
As the previous example showed, we need a way to think through potential normalizations systematically. Our initial discussion of writing examples gives an idea of how to do this. One of the guidelines there says to think about the domain of the inputs, and ways that inputs might vary. If we apply that in the context of loaded datasets, we should think about how the original data was collected.
Based on what you know about websites, where might the event code contents come from? How might they have been entered? What do thses tell you about different plausible mistakes in the data?
via a drop-down menu
in a text-entry box
A text-entry box means that any sort of typical human typing error could show up in your data: swapped letters, missing letters, leading spaces, capitalization, etc. You could also get data where someone just typed the wrong thing (or something random, just to see what your form would do).
Which of swapped letters, missing errors, and random text do you think a program can correct for automatically?
But really, the moral of this is to just use drop-downs or other means to prevent incorrect data at the source whenever possible.
As you get more experience with programming, you will also learn to anticipate certain kinds of errors. Issues such as cells that appear empty will become second nature once you’ve processed enough tables that have them, for example. Needing to anticipate data errors is one reason why good data scientists have to understand the domain that they are working in.
The takeaway from this is how we talked through what to expect. We thought about where the data came from, and what errors would be plausible in that situation. Having a clear error model in mind will help you develop more robust programs. In fact, such adversarial thinking is a core skill of working in security, but now we’re getting ahead of ourselves.
In spreadsheets, cells that appear empty sometimes have actual content, in the form of strings made up of spaces: both "" and " " appear the same when we look at a spreadsheet, but they are actually different values computationally.
How would you modify cell-to-discount-code so that strings containing only spaces were also converted to "none"? (Hint: look for string-replace in the strings library.)
Sometimes, we also look for errors by writing functions to check whether a table contains unexpected values. Let’s consider the "email" column: that’s a place where we should be able to write a program to flag any rows with invalid email addresses. What makes for a valid email address? Let’s consider two rules:
Valid email addresses should contain an @ sign
Valid email addresses should end in one of ".com", ".edu" or ".org"
Write a function is-email that takes a string and returns a boolean indicating whether the string satisfies the above three rules for being valid email addresses.
Assuming we had such a function, a routine filter-with could then produce a table identifying all rows that need to have their email addresses corrected. The point here is that programs are often helpful for finding data that needs correcting, even if a program can’t be written to perform the fixing.
Before we move on, it’s worth stepping back to reflect on our process for producing the discount-summary table. We
Developed a concrete example of what output we wanted
Checked the documentation for built-in functions that seemed helpful
Articulated a series of steps that would prepare our data to work with the built-in function that we found
Wrote expressions and code for each step
Put all of the steps together to get our final function
Sometimes, the data we have is clean (in that we’ve normalized the data and dealt with errors), but it still isn’t in a format that we can use for the analysis that we want to run. For example, what if we want to look at the distribution of small, medium, and large ticket orders? In our current table, we have the number of tickets in an order, but not an explicit label on the scale of that order. If we wanted to produce some sort of chart showing our order scales, we will need to make those labels explicit.
The act of reducing one set of values (such as the ticketcounts) into a smaller set of categories (such as small/medium/large for orders, or morning/afternoon/etc for timestamps) is known as binning. The bins are the categories. To put rows into bins, we create a function to compute the bin for a raw data value, then create a column for the new bin labels.
Here’s an example of creating bins for the scale of the ticket orders:
fun order-scale-label(r :: Row) -> String: doc: "categorize the number of tickets as small, medium, large" numtickets = r["tickcount"] if numtickets >= 10: "large" else if numtickets >= 5: "medium" else: "small" end end order-bin-data = build-column(cleaned-event-data, "order-scale", order-scale-label)
At this point, we have worked with several versions of the events table:
The original dataset that we tried to load
The new sheet of the dataset with manual corrections
The version with the discount codes normalized
Another version that normalized the delivery mode
The version extended with the order-scale column
Usually, we keep both the original raw source datasheet, as well as the copy with our manual corrections. Why? In case we ever have to look at the original data again, either to identify kinds of errors that people were making or to apply different fixes.
For similar reasons, we want to keep the cleaned (normalized) data separate from the version that we initially loaded. Fortunately, Pyret helps with this since it creates new tables, rather than modify the prior ones. If we have to normalize multiple columns, however, do we really need a new name for every intermediate table?
As a general rule, we usually maintain separate names for the initially-loaded table, the cleaned table, and for significant variations for analysis purposes. In our code, this might mean having names:
event-data = ... # the loaded table cleaned-event-data = transform-column( transform-column(event-data, "discount", cell-to-discount-code), "delivery", yes-to-email) order-bin-data = build-column( cleaned-event-data, "order-scale", order-scale-label)
Now that our data is cleaned and prepared, we are ready to analyze it. What might we want to know? Perhaps we want to know which discount code has been used most often. Maybe we want to know whether the time when a purchase was made correlates with how many tickets people buy. There’s a host of different kinds of visualizations and plots that people use to summarize data.
Which plot type to use depends on both the question and the data at hand. The nature of variables in a dataset helps determine relevant plots or statistical operations. An attribute or variable in a dataset (i.e., a single column of a table) can be classified as one of several different kinds, including:
quantitative: a variable whose values are numeric and can be ordered with a consistent interval between values. They are meaningful to use in computations.
categorical: a variable with a fixed set of values. The values may have an order, but there are no meaningful computational operations between the values other than ordering. Such variables usually correspond to characteristics of your samples.
Which kind of variable are last names? Grades in courses? Zipcodes?
Common plots and the kinds of variables they require include:
Scatterplots show relationships between two quantitative variables, with one variable on each axis of a 2D chart.
Frequency Bar charts show the frequency of each categorical value within a column of a dataset.
Histograms segment quantitative data into equal-size intervals, showing the distribution of values across each interval.
Pie charts show the proportion of cells in a column across the categorical values in a dataset.
Map each of the following questions to a chart type, based on the kinds of variables involved in the question:
Which discount code has been used most often?
Is there a relationship between the number of tickets purchased in one order and the time of purchase?
How many orders have been made for each delivery option?
For example, we might use a frequency-bar-chart to answer the third question. Based on the Table documentation, we would generate this using the following code (with similar style for the other kinds of plots):
Which yields the following chart (assuming we had not actually normalized the contents of the "delivery" column):
Whoa – where did that extra "email" column come from? If you look closely, you’ll spot the error: in the row for "Alvina", there’s a typo ("emall" with an l instead of an i) in the discount column (drop-down menus, anyone?).
The lesson here is that plots and visualizations are valuable not only in the analysis phase, but also early on, when we are trying to sanity check that our data is clean and ready to use. Good data scientists never trust a dataset without first making sure that the values make sense. In larger datasets, manually inspecting all of the data is often infeasible. But creating some plots or other summaries of the data is also useful for identifying errors.
This chapter has given you a high-level overview of how to use coding for managing and processing data. When doing any data analysis, a good data practitioner undergoes several steps:
Think about the data in each column: what are plausible values in the column, and what kinds of errors might be in that column based on what you know about the data collection methods?
Check the data for errors, using a combination of manual inspection of the table, plots, and filter-with expressions that check for unexpected values. Normalize or correct the data, either at the source (if you control that) or via small programs.
Store the normalized/cleaned data table, either as a name in your program, or by saving it back out to a new file. Leave the raw data intact (in case you need to refer to the original later).
Prepare the data based on the questions you want to ask about it: compute new columns, bin existing columns, or combine data from across tables. You can either finish all preparations and name the final table, or you can make separate preparations for each question, naming the per-question tables.
At last, perform your analysis, using the statistical methods, visualizations, and interpretations that make sense for the question and kinds of variables involved. When you report out on the data, always store notes about the file that holds your analysis code, and which parts of the file were used to generate each graph or interpretation in your report.
There’s a lot more to managing data and performing analysis than this book can cover. There are entire books, degrees, and careers in each of the management of data and its analysis. And that’s before we get farther into our explorations of data and computing! Onward!