7 Sets Appeal
Your Web browser records which Web pages you’ve visited, and some Web sites use this information to color visited links differently than ones you haven’t seen. When a vote is taken, the vote recorder cares that everyone’s vote is counted (and nobody has voted twice) but usually ignores the order in which they voted. When you use a search engine, you implicitly get a guarantee that each entry will be distinct but there is no guarantee of order: indeed, two subsequent searches may yield the “same” answer (i.e., the same set of hits) but in a different order.
How do we represent these kinds of information? Until now, we’ve had
a clear answer: lists. But explicit in the definition of a list—
check:
[1, 2, 3] is [3, 2, 1]
end
A set is different from a list: it is unordered, and it ignores duplicates. Because a set has different purposes, it also presents a different interface than does a list. Like lists, sets can contain any kind of element, but all of its contents must be of the same kind. For simplicity, for now we will focus on just sets of numbers.
mt-set :: Set |
|
is-in :: (T, Set<T> -> Bool) |
|
insert :: (T, Set<T> -> Set<T>) |
insert-many :: (List<T>, Set<T> -> Set<T>) |
|
union :: (Set<T>, Set<T> -> Set<T>) |
|
size :: (Set<T> -> Number) |
|
to-list :: (Set<T> -> List<T>) |
What does it mean to “ignore” duplicates? Define this more precisely in terms of how the set operations behave in the face of duplicate values.
7.1 Representing Sets by Lists
As a starting point, let’s consider the implementation of sets using lists as the underlying representation. After all, a set appears to merely be a list wherein we ignore the order of elements.
7.1.1 Representation Choices
mt-set = empty
fun size(s): s.length();
- There is a subtle difference between lists and sets. The list
[1, 1]
is not the same as[1]
because the first list has length two whereas the second has length one. Treated as a set, however, the two are the same: they both have size one. Thus, our implementation of size above is incorrect if we don’t take into account duplicates (either during insertion or while computing the size). We might falsely make assumptions about the order in which elements are retrieved from the set due to the ordering guaranteed provided by the underlying list representation. This might hide bugs that we don’t discover until we change the representation.
We might have chosen a set representation because we didn’t need to care about order, and expected lots of duplicate items. A list representation might store all the duplicates, resulting in significantly more memory use (and slower programs) than we expected.
insert = link
7.1.2 Time Complexity
If we don’t store duplicates, then size is simply length, which takes time linear in \(k\). Similarly, check only needs to traverse the list once to determine whether or not an element is present, which also takes time linear in \(k\). But insert needs to check whether an element is already present, which takes time linear in \(k\), followed by at most a constant-time operation (link).
If we do store duplicates, then insert is constant time: it simply links on the new element without regard to whether it already is in the set representation. check traverses the list once, but the number of elements it needs to visit could be significantly greater than \(k\), depending on how many duplicates have been added. Finally, size needs to check whether or not each element is duplicated before counting it.
What is the time complexity of size if the list has duplicates?
fun size(s):
cases (List) s:
| empty => 0
| link(f, r) =>
if r.member(f):
size(r)
else:
size(r) + 1;
end
end
Let’s now compute the complexity of the body of the function, assuming the number of distinct elements in s is \(k\) but the actual number of elements in s is \(d\), where \(d \geq k\). To compute the time to run size on \(d\) elements, \(T(d)\), we should determine the number of operations in each question and answer. The first question has \(O(1)\) operations, and the first answer \(O(1)\). The second question also has \(O(1)\) operations. Its answer is a conditional, whose first question (r.member(f) needs to traverse the entire list, and hence has \(O(d)\) operations. If it succeeds, we recur on something of size \(T(d-1)\); else we do the same but perform \(O(1)\) more operations. Thus \(T(0)\) is a constant, while the recurrence (in big-Oh terms) is \[T(d) = d + T(d-1)\] Thus \(T \in O([d \rightarrow d^2])\). Note that this is quadratic in the number of elements in the list, which may be much bigger than the size of the set.
7.1.3 Choosing Between Representations
|
| With Duplicates |
|
| Without Duplicates |
| ||
|
| insert |
| is-in |
| insert |
| is-in |
Size of Set |
| constant |
| linear |
| linear |
| linear |
Size of List |
| constant |
| linear |
| linear |
| linear |
Which representation we choose is a matter of how much duplication we expect. If there won’t be many duplicates, then the version that stores duplicates pays a small extra price in return for some faster operations.
Which representation we choose is also a matter of how often we expect each operation to be performed. The representation with duplication is “in the middle”: everything is roughly equally expensive (in the worst case). With duplicates is “at the extremes”: very cheap insertion, potentially very expensive membership. But if we will mostly only insert without checking membership, and especially if we know membership checking will only occur in situations where we’re willing to wait, then permitting duplicates may in fact be the smart choice. (When might we ever be in such a situation? Suppose your set represents a backup data structure; then we add lots of data but very rarely—
indeed, only in case of some catastrophe— ever need to look for things in it.) Another way to cast these insights is that our form of analysis is too weak. In situations where the complexity depends so heavily on a particular sequence of operations, big-Oh is too loose and we should instead study the complexity of specific sequences of operations. We will address precisely this question later (Halloween Analysis).
Moreover, there is no reason a program should use only one representation. It could well begin with one representation, then switch to another as it better understands its workload. The only thing it would need to do to switch is to convert all existing data between the representations.
How might this play out above? Observe that data conversion is very
cheap in one direction: since every list without duplicates is
automatically also a list with (potential) duplicates, converting in
that direction is trivial (the representation stays unchanged, only
its interpretation changes). The other direction is harder: we have to
filter duplicates (which takes time quadratic in the number of
elements in the list). Thus, a program can make an initial guess about
its workload and pick a representation accordingly, but maintain
statistics as it runs and, when it finds its assumption is wrong,
switch representations—
7.1.4 Other Operations
Implement the remaining operations catalogued above (<set-operations>) under each list representation.
Implement the operation
remove :: (Set<T>, T -> Set<T>)
under each list representation. What difference do you see?
Suppose you’re asked to extend sets with these operations, as the set analog of first and rest:
one :: (Set<T> -> T)
others :: (Set<T> -> T)
You should refuse to do so! Do you see why?
With lists the “first” element is well-defined, whereas sets are defined to have no ordering. Indeed, just to make sure users of your sets don’t accidentally assume anything about your implementation (e.g., if you implement one using first, they may notice that one always returns the element most recently added to the list), you really ought to return a random element of the set on each invocation.
Unfortunately, returning a random element means the above interface is
unusable. Suppose s is bound to a set containing 1,
2, and 3. Say the first time one(s) is invoked
it returns 2, and the second time 1. (This already
means one is not a function—
Why is it unreasonable for one(s) to produce the same result as one(others(s))?
Suppose you wanted to extend sets with a subset operation that partitioned the set according to some condition. What would its type be? See [REF join lists] for a similar operation.
7.2 Making Sets Grow on Trees
Let’s start by noting that it seems better, if at all possible, to avoid storing duplicates. Duplicates are only problematic during insertion due to the need for a membership test. But if we can make membership testing cheap, then we would be better off using it to check for duplicates and storing only one instance of each value (which also saves us space). Thus, let’s try to improve the time complexity of membership testing (and, hopefully, of other operations too).
It seems clear that with a (duplicate-free) list representation of a
set, we cannot really beat linear time for membership checking. This
is because at each step, we can eliminate only one element from
contention which in the worst case requires a linear amount of work to
examine the whole set. Instead, we need to eliminate many more
elements with each comparison—
In our handy set of recurrences (Solving Recurrences), one stands out: \(T(k) = T(k/2) + c\). It says that if, with a constant amount of work we can eliminate half the input, we can perform membership checking in logarithmic time. This will be our goal.
Before we proceed, it’s worth putting logarithmic growth in
perspective. Asymptotically, logarithmic is obviously not as nice as
constant. However, logarithmic growth is very pleasant because it
grows so slowly. For instance, if an input doubles from size \(k\) to
\(2k\), its logarithm—
7.2.1 Converting Values to Ordered Values
We have actually just made an extremely subtle assumption. When we check one element for membership and eliminate it, we have eliminated only one element. To eliminate more than one element, we need one element to “speak for” several. That is, eliminating that one value needs to have safely eliminated several others as well without their having to be consulted. In particular, then, we can no longer compare for mere equality, which compares one set element against another element; we need a comparison that compares against an element against a set of elements.
To do this, we have to convert an arbitrary datum into a datatype that
permits such comparison. This is known as hashing.
A hash function consumes an arbitrary value and produces a comparable
representation of it (its hash)—
Let us now consider how one can compute hashes. If the input datatype is a number, it can serve as its own hash. Comparison simply uses numeric comparison (e.g., <). Then, transitivity of < ensures that if an element \(A\) is less than another element \(B\), then \(A\) is also less than all the other elements bigger than \(B\). The same principle applies if the datatype is a string, using string inequality comparison. But what if we are handed more complex datatypes?
- Consider a list of primes as long as the string. Multiply each position by the corresponding prime, and add up the sum: in other words, compute the inner-product or dot-product. For instance, if the string is represented by the character codes [6, 4, 5] (the first character has code 6, the second one 4, and the third 5), we get the hash
(6 * 2) + (4 * 3) + (5 * 5)
or 49. - Simply add together all the character codes. For the above example, this would correspond to the has
6 + 4 + 5
or 15.
Now let us consider more general datatypes. The principle of hashing will be similar. If we have a datatype with several variants, we can use a numeric tag to represent the variants: e.g., the primes will give us invertible tags. For each field of a record, we need an ordering of the fields (e.g., lexicographic, or “alphabetical” order), and must hash their contents recursively; having done so, we get in effect a string of numbers, which we have shown how to handle.
Now that we have understood how one can deterministically convert any arbitrary datum into a number, in what follows, we will assume that the trees representing sets are trees of numbers. However, it is worth considering what we really need out of a hash. In Set Membership Again, we will not need partial ordering. Invertibility is more tricky. In what follows below, we have assumed that finding a hash is tantamount to finding the set element itself, which is not true if multiple values can have the same hash. In that case, the easiest thing to do is to store alongside the hash all the values that hashed to it, and we must search through all of these values to find our desired element. Unfortunately, this does mean that in an especially perverse situation, the desired logarithmic complexity will actually be linear complexity after all!
In real systems, hashes of values are typically computed by the programming language implementation. This has the virtue that they can often be made unique. How does the system achieve this? Easy: it essentially uses the memory address of a value as its hash. (Well, not so fast! Sometimes the memory system can and does move values around (Automated Reclamation, or Garbage Collection). In these cases computing a hash value is more complicated.)
7.2.2 Using Binary Trees
Because logs come from trees.
data BST:
| leaf
| node(v :: Number, l :: BST, r :: BST)
end
fun is-in(e :: Number, s :: BST) -> Bool:
cases (BST) s:
| leaf => false
| node(v, l, r) =>
if v == e: true else: is-in(e, l) or is-in(e, r);
end
end
How can we improve on this? The comparison needs to help us eliminate not only the root but also one whole sub-tree. We can only do this if the comparison “speaks for” an entire sub-tree. It can do so if all elements in one sub-tree are less than or equal to the root value, and all elements in the other sub-tree are greater than or equal to it. Of course, we have to be consistent about which side contains which subset; it is conventional to put the smaller elements to the left and the bigger ones to the right. This refines our binary tree definition to give us a binary search tree (BST).
Here is a candiate predicate for recognizing when a binary tree is in fact a binary search tree:
fun is-a-bst-buggy(b :: BST) -> Bool:
cases (BST) b:
| leaf => true
| node(v, l, r) =>
(is-leaf(l) or (l.v <= v)) and
(is-leaf(r) or (v <= r.v)) and
is-a-bst-buggy(l) and
is-a-bst-buggy(r)
end
end
Is this definition correct?
check:
is-a-bst-buggy(node(5, node(3, leaf, node(6, leaf, leaf)), leaf)) is true # WRONG!
end
Fix the BST checker.
fun is-in(e :: Number, s :: BST) -> Bool:
cases (BST) s:
| leaf => ...
| node(v, l :: BST, r :: BST) => ...
... is-in(l) ...
... is-in(r) ...
end
end
fun is-in(e :: Number, s :: BST) -> Bool:
cases (BST) s:
| leaf => false
| node(v, l, r) =>
if e == v:
true
else if e < v:
is-in(e, l)
else if e > v:
is-in(e, r);
end
end
fun insert(e :: Number, s :: BST) -> BST:
cases (BST) s:
| leaf => node(e, leaf, leaf)
| node(v, l, r) =>
if e == v:
s
else if e < v:
node(v, insert(e, l), r)
else if e > v:
node(v, l, insert(e, r));
end
end
Suppose we frequently needed to compute the size of a set. We ought to be able to reduce the time complexity of size by having each tree ☛ cache its size, so that size could complete in constant time (note that the size of the tree clearly fits my criterion for a cache, since it can always be reconstructed). Update the data definition and all affected functions to keep track of this information correctly.
But wait a minute. Are we actually done? Our recurrence takes the form \(T(k) = T(k/2) + c\), but what in our data definition guaranteed that the size of the child traversed by check will be half the size?
Construct an example—
consisting of a sequence of inserts to the empty tree— such that the resulting tree is not balanced. Show that searching for certain elements in this tree will take linear, not logarithmic, time in its size.
check:
insert(4, insert(3, insert(2, insert(1, mt-set)))) is
node(1, leaf,
node(2, leaf,
node(3, leaf,
node(4, leaf, leaf))))
end
Therefore, using a binary tree, and even a BST, does not guarantee the complexity we want: it does only if our inputs have arrived in just the right order. However, we cannot assume any input ordering; instead, we would like an implementation that works in all cases. Thus, we must find a way to ensure that the tree is always balanced, so each recursive call in size and check really do throw away half the elements.
7.2.3 A Fine Balance: Tree Surgery
Let’s define a balanced binary search tree (BBST). It must obviously be a search tree, so let’s focus on the “balanced” part. We have to be careful about precisely what this means: we can’t simply expect both sides to be of equal size because this demands that the tree (and hence the set) have an even number of elements and, even more stringently, to have a size that is a power of two.
Therefore, we relax the notion of balance to one that is both accommodating and sufficient. We first use the term balance factor for a node to refer to the height of its left child minus the height of its right child (where the height is the depth, in edges, of the deepest node). We then allow every node of a BBST to have a balance factor of \(-1\), \(0\), or \(1\) (but nothing else): that is, either both have the same height, or the left or the right can be one taller. Note that this is a recursive property, but it applies at all levels, so the imbalance cannot accumulate making the whole tree arbitrarily imbalanced.
Given this definition of a BBST, show that the number of nodes is exponential in the height. Thus, always recurring on one branch will terminate after a logarithmic (in the number of nodes) number of steps.
Here is an obvious but useful observation: every BBST is also a BST (this was true by the very definition of a BBST). Why does this matter? It means that a function that operates on a BST can just as well be applied to a BBST without any loss of correctness.
So far, so easy. All that leaves is a means of creating a BBST, because it’s responsible for ensuring balance. It’s easy to see that the constant empty-set is a BBST value. So that leaves only insert.
Here is our situation with insert. Assuming we start with a BBST, we can determine in logarithmic time whether the element is already in the tree and, if so, ignore it.To implement a bag, we track the count of each element, which still does not affect the tree’s size. When inserting an element, given balanced trees, the insert for a BST takes only a logarithmic amount of time to perform the insertion. Thus, if performing the insertion does not affect the tree’s balance, we’re done. Therefore, we only need to consider cases where performing the insertion throws off the balance.
Observe that because \(<\) and \(>\) are in symmetry (likewise with \(<=\) and \(>=\)), we can consider insertions into one half of the tree and a symmetric argument handles insertions into the other half. Thus, suppose we have a tree that is currently balanced into which we are inserting the element \(e\). Let’s say \(e\) is going into the left sub-tree and, by virtue of being inserted, will cause the entire tree to become imbalanced.Some trees, like family trees [REF], represent real-world data. It makes no sense to “balance” a family tree: it must accurately model whatever reality it represents. These set-representing trees, in contrast, are chosen by us, not dictated by some external reality, so we are free to rearrange them.
There are two ways to proceed. One is to consider all the places where we might insert \(e\) in a way that causes an imbalance and determine what to do in each case.
Enumerate all the cases where insertion might be problematic, and dictate what to do in each case.
The number of cases is actually quite overwhelming (if you didn’t think so, you missed a few...). Therefore, we instead attack the problem after it has occurred: allow the existing BST insert to insert the element, assume that we have an imbalanced tree, and show how to restore its balance.The insight that a tree can be made “self-balancing” is quite remarkable, and there are now many solutions to this problem. This particular one, one of the oldest, is due to G.M. Adelson-Velskii and E.M. Landis and, in honor of their initials, called an AVL Tree, though the tree itself is quite evident; their genius is in defining re-balancing.
Thus, in what follows, we begin with a tree that is balanced; insert causes it to become imbalanced; we have assumed that the insertion happened in the left sub-tree. In particular, suppose a (sub-)tree has a balance factor of \(2\) (positive because we’re assuming the left is imbalanced by insertion). The procedure for restoring balance depends critically on the following property:
Show that if a tree is currently balanced, i.e., the balance factor at every node is \(-1\), \(0\), or \(1\), then insert can at worst make the balance factor \(\pm 2\).
The algorithm that follows is applied as insert returns from its recursion, i.e., on the path from the inserted value back to the root. Since this path is of logarithmic length in the set’s size (due to the balancing property), and (as we shall see) performs only a constant amount of work at each step, it ensures that insertion also takes only logarithmic time, thus completing our challenge.
p |
/ \ |
q C |
/ \ |
A B |
Let’s say that \(C\) is of height \(k\). Before insertion, the tree rooted at \(q\) must have had height \(k+1\) (or else one insertion cannot create imbalance). In turn, this means \(A\) must have had height \(k\) or \(k-1\), and likewise for \(B\).
Why can they both not have height \(k+1\) after insertion?
7.2.3.1 Left-Left Case
p |
/ \ |
q C |
/ \ |
r B |
/ \ |
A1 A2 |
\(A_1 < r\).
\(r < A_2 < q\).
\(q < B < p\).
\(p < C\).
The height of \(A_1\) or of \(A_2\) is \(k\) (the cause of imbalance).
The height of the other \(A_i\) is \(k-1\) (see exercise above [REF]).
The height of \(C\) is \(k\) (initial assumption; \(k\) is arbitrary).
The height of \(B\) must be \(k-1\) or \(k\) (argued above).
q |
/ \ |
r p |
/ \ / \ |
A1 A2 B C |
7.2.3.2 Left-Right Case
p |
/ \ |
q C |
/ \ |
A r |
/ \ |
B1 B2 |
\(A < q\).
\(q < B_1 < r\).
\(r < B_2 < p\).
\(p < C\).
Suppose the height of \(C\) is \(k\).
The height of \(A\) must be \(k-1\) or \(k\).
The height of \(B_1\) or \(B_2\) must be \(k\), but not both (see exercise above [REF]). The other must be \(k-1\).
p |
/ \ |
r C |
/ \ |
q B2 |
/ \ |
A B1 |
r |
/ \ |
q p |
/ \ / \ |
A B1 B2 C |
7.2.3.3 Any Other Cases?
Were we a little too glib before? In the left-right case we said that only one of \(B_1\) or \(B_2\) could be of height \(k\) (after insertion); the other had to be of height \(k-1\). Actually, all we can say for sure is that the other has to at most height \(k-2\).
Can the height of the other tree actually be \(k-2\) instead of \(k-1\)?
If so, does the solution above hold? Is there not still an imbalance of two in the resulting tree?
Is there actually a bug in the above algorithm?