10 Graphs

In Cyclic Data we introduced a special kind of sharing: when the data become cyclic, i.e., there exist values such that traversing other reachable values from them eventually gets you back to the value at which you began. Data that have this characteristic are called graphs.Technically, a cycle is not necessary to be a graph; a tree or a DAG is also regarded as a (degenerate) graph. In this section, however, we are interested in graphs that have the potential for cycles.

Lots of very important data are graphs. For instance, the people and connections in social media form a graph: the people are nodes or vertices and the connections (such as friendships) are links or edges. They form a graph because for many people, if you follow their friends and then the friends of their friends, you will eventually get back to the person you started with. (Most simply, this happens when two people are each others’ friends.) The Web, similarly is a graph: the nodes are pages and the edges are links between pages. The Internet is a graph: the nodes are machines and the edges are links between machines. A transportation network is a graph: e.g., cities are nodes and the edges are transportation links between them. And so on. Therefore, it is essential to understand graphs to represent and process a great deal of interesting real-world data.

Graphs are important and interesting for principled reasons, too. For one thing, the property that we can end up where we began means that traditional methods of processing will no longer work: if we blindly process every node we visit, we could end up in an infinite loop. Therefore, we need better structural recipes for our programs. In addition, graphs have a very rich structure, which lends itself to several interesting computations over them. We will study both these aspects of graphs below.

10.1 Representations

There are numerous ways to represent graphs, and the choice of representation depends on several factors:

The structure of the graph, and in particular, its density. We will discuss this further later (Measuring Complexity for Graphs).
The representation in which the data are provided by external sources. Sometimes it may be easier to simply adapt to their representation; in particular, in some cases there may not even be a choice.
The features provided by the programming language, which make some representations much harder to use than others.

We will use primarily one representation in this section, but for the sake of exploration, we’ll look at a few others as well.

10.1.1 Direct Links

In the representation we have already seen (Cyclic Data), we have assumed that each node is some structured datum that has references to the nodes that are adjacent to it. Because this was previously left implicit, however, let us define this explicitly:

data SharedNode:
| node(content, adj :: List<SharedNode>)
end

# A graph is a List<SharedNode>

For instance, here is a small network of airports and the cities to which each one has direct flights:

<flight-graph> ::=

graph:

EWR = node("Newark", [ORD, DEN, SFO, HOU])

ORD = node("Chicago", [EWR, SFO])

WOS = node("Worcester", [])

HOU = node("Houston", [EWR, SFO])

DEN = node("Denver", [EWR, SFO])

SFO = node("San Francisco", [EWR, DEN, HOU])

end

sg-cities = [EWR, ORD, WOS, HOU, DEN, SFO]

With this definition, we can write a function that gives us the names of cities to which we can fly directly from a given city:

fun sg-neighbors(city :: SharedNode,  graph :: List<SharedNode>)
    -> List<SharedNode>:
  map(fun(n): n.content;, city.adj)
end

For example:

check:
sg-neighbors(HOU, sg-cities)
is ["Newark", "San Francisco"]
end

10.1.2 Links by Name

Many languages make it difficult to construct graphs directly in the language’s syntax (though we will see how we can construct graphs with direct references—with more effort and less elegantly—later [REF]). In most languages, therefore, graphs must be described using indirect references between nodes. Here we see one of them: giving each node a “name” and then looking it up by name.In contrast, with direct references every node is its own “name”; putting it differently, because we can access an adjacent node directly, we don’t need a name with which to look it up.

First, the structure of data:

data KeyedNode:
| keyed-node(key :: String, content, adj :: List<String>)
end

Note that every node has a key, and its neighbors are identified by strings that we have to look up to get to the actual node. Here’s how we do it:

fun find-kn(key, graph :: List<KeyedNode>) -> KeyedNode:
  matches = for filter(n from graph):
    n.key == key
  end
  matches.first # there had better be exactly one!
end

With this representation, we can write the same datum again:

kn-cities = block:
  knEWR = keyed-node("nwk", "Newark", ["chi", "den", "saf", "hou"])
  knORD = keyed-node("chi", "Chicago", ["nwk", "saf"])
  knWOS = keyed-node("wor", "Worcester", [])
  knHOU = keyed-node("hou", "Houston", ["nwk", "saf"])
  knDEN = keyed-node("den", "Denver", ["nwk", "saf"])
  knSFO = keyed-node("saf", "San Francisco", ["nws", "den", "hou"])
  [knEWR, knORD, knWOS, knHOU, knDEN, knSFO]
end

Finally, we can look up neighbors:

fun kn-neighbors(city :: KeyedNode,  graph :: List<KeyedNode>)
    -> List<KeyedNode>:
  map(fun(n): n.content;, map(find-kn(_, graph), city.adj))
end

With this our test continues to work:

check:
kn-neighbors(find-kn("hou", kn-cities), kn-cities)
is ["Newark", "San Francisco"]
end

10.1.3 Links by Indices

In some languages, it is common to use numbers as names. This is especially useful when numbers can be used to get access to an element in a constant amount of time (in return for having a bound on the number of elements that can be accessed). Here, we use a list—which does not provide constant-time access to any element—to represent the collection to illustrate this concept. First, the datatype:

data IndexedNode:
| idxed-node(content, adj :: List<Number>)
end

To find a node:

fun find-in(idx, graph :: List<IndexedNode>) -> IndexedNode:
index(graph, idx)
end

Our graph now looks like this:

in-cities = block:
  inEWR = idxed-node("Newark", [1, 4, 5, 3])
  inORD = idxed-node("Chicago", [0, 5])
  inWOS = idxed-node("Worcester", [])
  inHOU = idxed-node("Houston", [0, 5])
  inDEN = idxed-node("Denver", [0, 5])
  inSFO = idxed-node("San Francisco", [0, 4, 3])
  [inEWR, inORD, inWOS, inHOU, inDEN, inSFO]
end

We can then find neighbors almost as before:

fun in-neighbors(city :: IndexedNode,  graph :: List<IndexedNode>)
    -> List<IndexedNode>:
  map(fun(n): n.content;, map(find-in(_, graph), city.adj))
end

and our test works as before:

check:
in-neighbors(find-in(3, in-cities), in-cities)
is ["Newark", "San Francisco"]
end

10.1.4 A List of Edges

The three representations we have seen until now have given priority to nodes, making edges simply a part of the information in a node. We could, instead, use a representation that makes edges primary, and nodes simply be the entities that lie at their ends:

data Edge:
| edge(src :: String, dst :: String)
end
# A graph is a list of Edges

Then, our flight network becomes:

en-cities =
  [ edge("Newark", "Chicago"),
    edge("Newark", "Denver"),
    edge("Newark", "San Francisco"),
    edge("Newark", "Houston"),
    edge("Chicago", "Newark"),
    edge("Chicago", "San Francisco"),
    edge("Houston", "Newark"),
    edge("Houston", "San Francisco"),
    edge("Denver", "Newark"),
    edge("Denver", "San Francisco"),
    edge("San Francisco", "Newark"),
    edge("San Francisco", "Denver"),
    edge("San Francisco", "Houston") ]

To obtain the set of neighbors:

fun en-neighbors(city :: String, graph :: List<Edge>) -> List<String>:
  neighbors = for filter(c from graph):
    city == c.src
  end
  names = for map(c from neighbors): c.dst;
  names
end

And to be sure:

check:
en-neighbors("Houston", en-cities) is ["Newark", "San Francisco"]
end

10.2 Measuring Complexity for Graphs

Before we begin to define algorithms over graphs, we should consider how to measure the size of a graph. A graph has two components: its nodes and its edges. Some algorithms are going to focus on nodes (e.g., visiting each of them), while others will focus on edges, and some will care about both. So which do we use as the basis for counting operations: nodes or edges?

It would help if we can reduce these two measures to one. To see whether that’s possible, suppose a graph has \(k\) nodes. Then its number of edges has a wide range:

No two nodes are connected. Then there are no edges at all.
Every two nodes is connected. Then there are essentially as many edges as the number of pairs of nodes, or \(k^2\).

In general, we call a graph with significantly fewer edges than nodes a sparse graph, while one with significantly more edges than nodes is called dense.

The number of nodes can thus be significantly less or even significantly more than the number of edges. Were this difference a matter of constants, we could have ignored it; but it’s not. For sparse graphs, the number of nodes dominates the number of edges by a factor of \(k\) (or even infinity, if there truly are zero edges, but such graphs are usually not very interesting or difficult to process); for extremely dense graphs, too, the ratio is one of \(k\), but in the other direction.

Therefore, when we want to speak of the complexity of algorithms over graphs, we have to consider the sizes of both the number of nodes and edges. In a connected graphA graph is connected if, from every node, we can traverse edges to get to every other node., however, there must be at least as many edges as nodes, which means the number of edges dominates the number of nodes. Since we are usually processing connected graphs, or connected parts of graphs one at a time, we can bound the number of nodes by the number of edges.

10.3 Reachability

Many uses of graphs need to address reachability: whether we can get, using edges in the graph, from one node to another. For instance, a social network might suggest as contacts all those who are reachable from existing contacts. On the Internet, traffic engineers care about whether packets can get from one machine to another. On the Web, we care about whether all public pages on a site are reachable from the home page.

In what follows, we will use the travel graph (<flight-graph>) as a running example, and direct links (Direct Links) as its representation. We will develop the implementation over multiple iterations.

10.3.1 Simple Recursion

At its simplest, reachability is easy. We want to know whether there exists a pathA path is a sequence of zero or more linked edges. between a pair of nodes, a source and a destination. (A more sophisticated version of reachability might compute the actual path, but we’ll ignore this for now.) There are two possibilities: the source and destintion nodes are the same, or they’re not.

If they are the same, then clearly reachability is trivially satisfied.
If they are not, we have to iterate through the neighbors of the source node and ask whether the destination is reachable from each of those neighbors.

This translates into the following function:

<graph-reach-1-main> ::=

fun reach-1(src :: Node, dst :: Node) -> Bool:

if identical(src, dst):

true

else:

<graph-reach-1-loop>

loop(src.adj)

end

where the loop through the neighbors of src is:

<graph-reach-1-loop> ::=

fun loop(neighbors):

cases (List) neighbors:

| empty => false

| link(f, r) =>

if reach-1(f, dst): true else: loop(r);

end

We can test this as follows (after binding reach as reach-1):

<graph-reach-tests> ::=

check:

reach(EWR, EWR) is true

reach(EWR, ORD) is true

reach(EWR, WOS) is false

reach(EWR, HOU) is true

reach(EWR, DEN) is true

reach(EWR, SFO) is true

end

Unfortunately, we don’t find out about how these tests fare, because some of them don’t complete at all. That’s because we have an infinite loop, due to the cyclic nature of graphs!

Exercise

Which of the above examples leads to a cycle? Why?

10.3.2 Cleaning up the Loop

Before we continue, let’s try to improve the expression of the loop. While the nested function above is a perfectly reasonable definition, we can use Pyret’s for to improve its readability.

The essence of the above loop is to iterate over a list of boolean values; if one of them is true, the entire loop evaluates to true; if they are all false, then we haven’t found a path to the destination node, so the loop evaluates to false. Thus:

fun ormap(fun-body, l):
  cases (List) l:
    | empty => false
    | link(f, r) =>
      if fun-body(f): true else: ormap(fun-body, r);
  end
end

With this, we can replace the loop definition and use with:

for ormap(n from src.adj):
reach-1(n, dst)
end

10.3.3 Traversal with Memory

Because we have cyclic data, we have to remember what nodes we’ve already visited and avoid traversing them again. Then, every time we begin traversing a new node, we add it to the set of nodes we’ve already started to visit so that. If we return to that node, because we can assume the graph has not changed in the meanwhile, we know that additional traversals from that node won’t make any difference to the outcome.This property is known as ☛ idempotence.

We therefore define a second attempt at reachability that take an extra argument: the set of nodes we have begun visiting (where the set is represented as a graph). The key difference from <graph-reach-1-main> is, before we begin to traverse edges, we should check whether we’ve begun processing the node or not. This results in the following definition:

<graph-reach-2> ::=

fun reach-2(src :: Node, dst :: Node, visited :: List<Node>)

-> Bool:

if is-in(src, visited):

false

else if identical(src, dst):

true

else:

new-visited = link(src, visited)

for ormap(n from src.adj):

reach-2(n, dst, new-visited)

end

Exercise

Define is-in.

Exercise

Does it matter if the first two conditions were swapped, i.e., the beginning of reach-2 began with
if identical(src, dst):
true
else if is-in(src, visited):
false
?

Exercise

We repeatedly talk about remembering the nodes that we have begun to visit, not the ones we’ve finished visiting. Does this distinction matter? How?

This now gives us a correct implementation, as we can confirm by trying the tests in <graph-reach-tests> after binding reach to reach-2 and passing the empty list as the initial set of visited nodes.

10.3.4 A Better Interface

As the process of testing reach-2 shows, we may have a correct implementation, but we’ve changed the function’s interface; now it has a needless extra argument, which is not only a nuisance but might also result in errors if we accidentally misuse it. Therefore, we should clean up our definition by moving the core code to an internal function:

fun reach-3(s :: Node, d :: Node) -> Bool:
  fun reacher(src :: Node,
              dst :: Node,
              visited :: List<Node>) -> Bool:
    if is-in(src, visited):
      false
    else if identical(src, dst):
      true
    else:
      new-visited = link(src, visited)
      for ormap(n from src.adj):
        reacher(n, dst, new-visited)
      end
    end
  end
  reacher(s, d, empty)
end

We have now restored the original interface while correctly implementing reachability.

10.4 Depth- and Breadth-First Traversals

It is conventional for computer science texts to call these depth- and breadth-first search. However, searching is just a specific purpose; traversal is a general task that can be used for many purposes.

The reachability algorithm we have seen above has a special property. At every node it visits, there is usually a set of adjacent nodes at which it can continue the traversal. It has at least two choices: it can either visit each immediate neighbor first, then visit all of the neighbors’ neighbors; or it can choose a neighbor, recur, and visit the next immediate neighbor only after that visit is done. The former is known as breadth-first traversal, while the latter is depth-first traversal.

The algorithm we have designed uses a depth-first strategy: inside <graph-reach-1-loop>, we recur on the first element of the list of neighbors before we visit the second neighbor, and so on. The alternative would be to have a data structure into which we insert all the neighbors, then pull out an element at a time such that we first visit all the neighbors before their neighbors, and so on. This naturally corresponds to a queue (An Example: Queues from Lists).

Exercise

Using a queue, implement breadth-first traversal.

If we correctly check to ensure we don’t re-visit nodes, then both breadth- and depth-first traversal will properly visit the entire reachable graph without repetition (and hence not get into an infinite loop). Each one traverses from a node only once, from which it considers every single edge. Thus, if a graph has \(N\) nodes and \(E\) edges, then a lower-bound on the complexity of traversal is \(O(N + E)\). We must also consider the cost of checking whether we have already visited a node before (which is a set membership problem, which we address elsewhere: Sets Appeal). Finally, we have to consider the cost of maintaining the data structure that keeps track of our traversal. In the case of depth-first traversal, recursion—which uses the machine’s stack—does it automatically at constant overhead. In the case of breadth-first traversal, the program must manage the queue, which can add more than constant overhead.In practice, too, the stack will usually perform much better than a queue, because it is supported by machine hardware.

This would suggest that depth-first traversal is always better than breadth-first traversal. However, breadth-first traversal has one very important and valuable property. Starting from a node \(N\), when it visits a node \(P\), count the number of edges taken to get to \(P\). Breadth-first traversal guarantees that there cannot have been a shorter path to \(P\): that is, it finds a shortest path to \(P\).

Exercise

Why “a” rather than “the” shortest path?

Do Now!

Prove that breadth-first traversal finds a shortest path.

10.5 Graphs With Weighted Edges

Consider a transportation graph: we are usually interested not only in whether we can get from one place to another, but also in what it “costs” (where we may have many different cost measures: money, distance, time, units of carbon dioxide, etc.). On the Internet, we might care about the ☛ latency or ☛ bandwidth over a link. Even in a social network, we might like to describe the degree of closeness of a friend. In short, in many graphs we are interested not only in the direction of an edge but also in some abstract numeric measure, which we call its weight.

In the rest of this study, we will assume that our graph edges have weights. This does not invalidate what we’ve studied so far: if a node is reachable in an unweighted graph, it remains reachable in a weighted one. But the operations we are going to study below only make sense in a weighted graph.We can, however, always treat an unweighted graph as a weighted one by giving every edge the same, constant, positive weight (say one).

Exercise

When treating an unweighted graph as a weighted one, why do we care that every edge be given a positive weight?

Let us first revise our data definition to include weights:

data Node:
| node(content,
adj :: List<{weight : Number, dst : Node}>)
end

# A graph is a List<Node>

10.6 Shortest (or Lightest) Paths

Imagine planning a trip: it’s natural that you might want to get to your destination in the least time, or for the least money, or some other criterion that involves minimizing the sum of edge weights. This is known as computing the shortest path.

We should immediately clarify an unfortunate terminological confusion. What we really want to compute is the lightest path—the one of least weight. Unfortunately, computer science terminology has settled on the terminology we use here; just be sure to not take it literally.

Exercise

Construct a graph and select a pair of nodes in it such that the shortest path from one to the other is not the lightest one, and vice versa.

We have already seen (Depth- and Breadth-First Traversals) that breadth-first search constructs shortest paths in unweighted graphs. These correspond to lightest paths when there are no weights (or, equivalently, all weights are identical and positive). Now we have to generalize this to the case where the edges have weights.

We will proceed inductively, gradually defining a function seemingly of this type

w :: Node -> Number

that reflects the weight of the lightest path from the source node to that one. But let’s think about this annotation: since we’re building this up node-by-node, initially most nodes have no weight to report; and even at the end, a node that is unreachable from the source will have no weight for a lightest (or indeed, any) path. Rather than make up a number that pretends to reflect this situation, we will instead use an option type:

w :: Node -> Option<Number>

When there is some value, it will be the weight, otherwise the weight will be none.

Now let’s think about this inductively. What do we know initially? Well, certainly that the source node is at a distance of zero from itself (that must be the lightest path, because we can’t get any lighter). This gives us a (trivial) set of nodes for which we already know the lightest weight. Our goal is to grow this set of nodes—modestly, by one, on each iteration—until we either find the destination, or we have no more nodes to add (in which case our desination is not reachable from the source).

Inductively, at each step we have the set of all nodes for which we know the lightest path (initially this is just the source node, but it does mean this set is never empty, which will matter in what we say next). Now consider all the edges adjacent to this set of nodes that lead to nodes for which we don’t already know the lightest path. Choose a node that minimizes the total weight of the path to it. We claim that this will in fact be the lightest path to that node.

If this claim is true, then we are done. That’s because we would now add \(q\) to the set of nodes whose lightest weights we now know, and repeat the process of finding lightest outgoing edges from there. This process has thus added one more node. At some point we will find that there are no edges that lead outside the known set, at which point we can terminate.

It stands to reason that terminating at this point is safe: it corresponds to having computed the reachable set. The only thing left is to demonstrate that this greedy algorithm yields a lightest path to each node.

We will prove this by contradiction. Suppose we have the path \(s \rightarrow d\) from source \(s\) to node \(d\), as found by the algorithm above, but assume also that we have a different path that is actually lighter. At every node, when we added a node along the \(s \rightarrow d\) path, the algorithm would have added a lighter path if it existed. The fact that it did not falsifies our claim that a lighter path exists (there could be a different path of the same weight; this would be permitted by the algorithm, but it also doesn’t contradict our claim). Therefore the algorithm does indeed find the lightest path.

What remains is to determine a data structure that enables this algorithm. At every node, we want to know the least weight from the set of nodes for which we know the least weight to all their neighbors. We could achieve this by sorting, but this is overkill: we don’t actually need a total ordering on all these weights, only the lightest one. A heap [REF] gives us this.

Exercise

What if we allowed edges of weight zero? What would change in the above algorithm?

Exercise

What if we allowed edges of negative weight? What would change in the above algorithm?

For your reference, this algorithm is known as Dijkstra’s Algorithm.

10.7 Moravian Spanning Trees

At the turn of the milennium, the US National Academy of Engineering surveyed its members to determine the “Greatest Engineering Achievements of the 20th Century”. The list contained the usual suspects: electronics, computers, the Internet, and so on. But a perhaps surprising idea topped the list: (rural) electrification.Read more about it on their site.

10.7.1 The Problem

To understand the history of national electrical grids, it helps to go back to Moravia in the 1920s. Like many parts of the world, it was beginning to realize the benefits of electricity and intended to spread it around the region. A Moravian academia named Otakar Borůvka heard about the problem, and in a remarkable effort, described the problem abstractly, so that it could be understood without reference to Moravia or electrical networks. He modeled it as a problem about graphs.

Borůvka observed that at least initially, any solution to the problem of creating a network must have the following characteristics:

The electrical network must reach all the towns intended to be covered by it. In graph terms, the solution must be spanning, meaning it must visit every node in the graph.
Redundancy is a valuable property in any network: that way, if one set of links goes down, there might be another way to get a payload to its destination. When starting out, however, redundancy may be too expensive, especially if it comes at the cost of not giving someone a payload at all. Thus, the initial solution was best set up without loops or even redundant paths. In graph terms, the solution had to be a tree.
Finally, the goal was to solve this problem for the least cost possible. In graph terms, the graph would be weighted, and the solution had to be a minimum.

Thus Borůvka defined the Moravian Spanning Tree (MST) problem.

10.7.2 A Greedy Solution

Borůvka had published his problem, and another Czech mathematician, Vojtěch Jarník, came across it. Jarník came up with a solution that should sound familiar:

Begin with a solution consisting of a single node, chosen arbitrarily. For the graph consisting of this one node, this solution is clearly a minimum, spanning, and a tree.
Of all the edges incident on nodes in the solution that connect to a node not already in the solution, pick the edge with the least weight.Note that we consider only the incident edges, not their weight added to the weight of the node to which they are incident.
Add this edge to the solution. The claim is that for the new solution will be a tree (by construction), spanning (also by construction), and a minimum. The minimality follows by an argument similar to that used for Dijkstra’s Algorithm.

Jarník had the misfortune of publishing this work in Czech in 1930, and it went largely ignored. It was rediscovered by others, most notably by R.C. Prim in 1957, and is now generally known as Prim’s Algorithm, though calling it Jarník’s Algorithm would attribute credit in the right place.

Implementing this algorithm is pretty easy. At each point, we need to know the lightest edge incident on the current solution tree. Finding the lightest edge takes time linear in the number of these edges, but the very lightest one may create a cycle. We therefore need to efficiently check for whether adding an edge would create a cycle, a problem we will return to multiple times (Checking Component Connectedness). Assuming we can do that effectively, we then want to add the lightest edge and iterate. Even given an efficient solution for checking cyclicity, this would seem to require an operation linear in the number of edges for each node. With better representations we can improve on this complexity, but let’s look at other ideas first.

10.7.3 Another Greedy Solution

Recall that Jarník presented his algorithm in 1930, when computers didn’t exist, and Prim his in 1957, when they were very much in their infancy. Programming computers to track heaps was a non-trivial problem, and many algorithms were implemented by hand, where keeping track of a complex data structure without making errors was harder still. There was need for a solution that was required less manual bookkeeping (literally speaking).

In 1956, Joseph Kruskal presented such a solution. His idea was elegantly simple. The Jarník algorithm suffers from the problem that each time the tree grows, we have to revise the content of the heap, which is already a messy structure to track. Kruskal noted the following.

To obtain a minimum solution, surely we want to include one of the edges of least weight in the graph. Because if not, we can take an otherwise minimal solution, add this edge, and remove one other edge; the graph would still be just as connected, but the overall weight would be no more and, if the removed edge were heavier, would be less.Note the careful wording: there may be many edges of the same least weight, so adding one of them may remove another, and therefore not produce a lighter tree; but the key point is that it certainly will not produce a heavier one. By the same argument we can add the next lightest edge, and the next lightest, and so on. The only time we cannot add the next lightest edge is when it would create a cycle (that problem again!).

Therefore, Kruskal’s algorithm is utterly straightforward. We first sort all the edges, ordered by ascending weight. We then take each edge in ascending weight order and add it to the solution provided it will not create a cycle. When we have thus processed all the edges, we will have a solution that is a tree (by construction), spanning (because every connected vertex must be the endpoint of some edge), and of minimum weight (by the argument above). The complexity is that of sorting (which is \([e \rightarrow e \log e]\) where \(e\) is the size of the edge set. We then iterate over each element in \(e\), which takes time linear in the size of that set—modulo the time to check for cycles. This algorithm is also easy to implement on paper, because we sort all the edges once, then keep checking them off in order, crossing out the ones that create cycles—with no dynamic updating of the list needed.

10.7.4 A Third Solution

Both the Jarník and Kruskal solutions have one flaw: they require a centralized data structure (the priority heap, or the sorted list) to incrementally build the solution. As parallel computers became available, and graph problems grew large, computer scientists looked for solutions that could be implemented more efficiently in parallel—which typically meant avoiding any centralized points of synchronization, such as these centralized data structures.

In 1965, M. Sollin constructed an algorithm that met these needs beautifully. In this algorithm, instead of constructing a single solution, we grow multiple solution components (potentially in parallel if we so wish). Each node starts out as a solution component (as it was at the first step of Jarník’s Algorithm). Each node considers the edges incident to it, and picks the lightest one that connects to a different component (that problem again!). If such an edge can be found, the edge becomes part of the solution, and the two components combine to become a single component. The entire process repeats.

Because every node begins as part of the solution, this algorithm naturally spans. Because it checks for cycles and avoids them, it naturally forms a tree.Note that avoiding cycles yields a DAG and is not automatically guaranteed to yield a tree. We have been a bit lax about this difference throughout this section. Finally, minimality follows through similar reasoning as we used in the case of Jarník’s Algorithm, which we have essentially run in parallel, once from each node, until the parallel solution components join up to produce a global solution.

Of course, maintaining the data for this algorithm by hand is a nightmare. Therefore, it would be no surprise that this algorithm was coined in the digital age. The real surprise, therefore, is that it was not: it was originally created by Otakar Borůvka himself.

Borůvka, you see, had figured it all out. He’d not only understood the problem, he had:

pinpointed the real problem lying underneath the electrification problem so it could be viewed in a context-independent way,
created a descriptive language of graph theory to define it precisely, and
even solved the problem in addition to defining it.

He’d just come up with a solution so complex to implement by hand that Jarník had in essence de-parallelized it so it could be done sequentially. And thus this algorithm lay unnoticed until it was reinvented (several times, actually) by Sollin in time for parallel computing folks to notice a need for it. But now we can just call this Borůvka’s Algorithm, which is only fitting.

As you might have guessed by now, this problem is indeed called the MST in other textbooks, but “M” stands not for Moravia but for “Minimum”. But given Borůvka’s forgotten place in history, I prefer the more whimsical name.

10.7.5 Checking Component Connectedness

As we’ve seen, we need to be able to efficiently tell whether two nodes are in the same component. One way to do this is to conduct a depth-first traversal (or breadth-first traversal) starting from the first node and checking whether we ever visit the second one. (Using one of these traversal strategies ensures that we terminate in the presence of loops.) Unfortunately, this takes a linear amount of time (in the size of the graph) for every pair of nodes—and depending on the graph and choice of node, we might do this for every node in the graph on every edge addition! So we’d clearly like to do this better.

It is helpful to reduce this problem from graph connectivity to a more general one: of disjoint-set structure (colloquially known as union-find for reasons that will soon be clear). If we think of each connected component as a set, then we’re asking whether two nodes are in the same set. But casting it as a set membership problem makes it applicable in several other applications as well.

The setup is as follows. For arbitrary values, we want the ability to think of them as elements in a set. Thus we would like a representation for set elements:

data Element<t>:
| elt (val :: T)
end

We are interested in two operations. One is obviously union, which merges two sets into one. The other would seem to be something like is-in-same-set that takes two elements and determines whether they’re in the same set. Over time, however, it has proven useful to instead define the operator find that, given an element, “names” the set (more on this in a moment) that the element belongs to. To check whether two elements are in the same set, we then have to get the “set name” for each element, and check whether these names are the same. This certainly sounds more roundabout, but this means we have a primitive that may be useful in other contexts, and from which we can easily implement is-in-same-set.

Now the question is, how do we name sets? The real question we should ask is, what operations do we care to perform on these names? All we care about is, given two names, they represent the same set precisely when the names are the same. Therefore, we could construct a new string, or number, or something else, but we have another option: simply pick some element of the set to represent it, i.e., to serve as its name. Then, identical compares these for equality, and all we need to ensure is this: for a given set, we always return the same representative element. (Otherwise, equality will fail even though we have the same set.) Thus:We’ve used the name fynd because find is already defined to mean something else in Pyret.

fun is-in-same-set(e1 :: Element, e2 :: Element, s :: List<Element>)
    -> Bool:
  s1 = fynd(e1, s)
  s2 = fynd(e2, s)
  identical(s1, s2)
end

So here’s the idea. We will represent a set of elements as a set (or list) of pairs of elements. The paired element will be an indicator of the “set name” for that element, if it is some other element other than this one. Thus:

data Element:
| elt (val, parent :: Option<Element>)
end

where a disjoint set is a List<Element>. We will assume we have some equality predicate for checking when two elements are the same, which we do by comparing their value parts, ignoring their parent values:

fun is-same-element(e1, e2): e1.val == e2.val end

Do Now!

Why do we check the value parts and not just use identical? What flexibility do we gain from this?

How do we find the representative element for a set? We first find it in the set using is-same-element; when we do, we check the element’s parent field. If it is none, that means this very element names its set; this can happen either because the element is a singleton set (we’ll initialize all elements with none), or it’s the name for some larger set. Either way, we’re done. Otherwise, we have to recursively find the parent:

fun fynd(e :: Element, s :: List<Element>) -> Element:
  cases (List) s:
    | empty => raise("fynd: shouldn't have gotten here")
    | link(f, r) =>
      if is-same-element(f, e):
        cases (Option) f.parent:
          | none => f
          | some(p) => fynd(p, s);
      else:
        fynd(e, r)
      end
  end
end

Exercise

Why does this have to be recursive?

what’s left is to implement union. For this, we find the representative elements of the two sets we’re trying to union; if they are the same, then the two sets are already in a union; otherwise, we have to update the data structure:

fun union(e1 :: Element, e2 :: Element, s :: List<Element>)
    -> List<Element>:
  s1 = fynd(e1, s)
  s2 = fynd(e2, s)
  if identical(s1, s2):
    s
  else:
    update-set-with(s, s1, s2);
end

To update, we arbitrarily choose one of the set names to be the name of the new compound set. We then have to update the parent of the other set’s name element to be this one:

fun update-set-with(s :: List<Element>, child :: Element, parent :: Element)
    -> List<Element>:
  cases (List) s:
    | empty => raise("update: shouldn't have gotten here")
    | link(f, r) =>
      if is-same-element(f, child):
        link(elt(f.val, some(parent)), r)
      else:
        link(f, update-set-with(r, child, parent));
  end
end

Here are some tests to illustrate this working:

check:
  s0 = map(elt(_, none), [0, 1, 2, 3, 4, 5, 6, 7])
  s1 = union(index(s0, 0), index(s0, 2), s0)
  s2 = union(index(s1, 0), index(s1, 3), s1)
  s3 = union(index(s2, 3), index(s2, 5), s2)
  is-same-element(fynd(index(s0, 0), s3), fynd(index(s0, 5), s3)) is true
  is-same-element(fynd(index(s0, 2), s3), fynd(index(s0, 5), s3)) is true
  is-same-element(fynd(index(s0, 3), s3), fynd(index(s0, 5), s3)) is true
  is-same-element(fynd(index(s0, 5), s3), fynd(index(s0, 5), s3)) is true
  is-same-element(fynd(index(s0, 7), s3), fynd(index(s0, 7), s3)) is true
end

Unfortunately, this implementation suffers from two major problems:

First, because we are performing functional updates, the value of the parent reference keeps “changing”, but these changes are not visible to older copies of the “same” value. This is why we can’t use identical: an element from different stages of unioning has differnt parent references, even though it is arguably the same element throughout. This is a place where functional programming hurts.
Relatedly, the performance of this implementation is quite bad. fynd recursively traverses parents to find the set’s name, but the elements traversed are not updated to record this new name. We certainly could update them by reconstructing the set afresh each time, but that complicates the implementation and, as we will soon see, we can do much better.

The bottom line is that pure functional programming is not a great fit with this problem. We need a better implementation strategy: Implementing Disjoint Sets with Mutation.

← prev up next →

1	Introduction
2	The Value(s) of Computing
3	Functions
4	Data Structures
5	Fun with Functions
6	Predicting Growth
7	Sets Appeal
8	Halloween Analysis
9	Sharing and Equality
10	Graphs
11	State and Change
12	Shrinking the Language [EMPTY]
13	Everything (We Will Say) About Parsing
14	A First Look at Interpretation
15	Evaluating Functions
16	Mutation: Structures and Variables
17	Recursion and Cycles
18	Objects
19	Control Operations
20	Checking Program Invariants Statically: Types
21	Dynamic Reliability: Contracts [EMPTY]
22	Alternate Application Semantics [EMPTY]
23	Automated Memory Management
24	Representation Decisions
25	Desugaring as a Language Feature [EMPTY]
26	Glossary

10.1	Representations
10.2	Measuring Complexity for Graphs
10.3	Reachability
10.4	Depth- and Breadth-First Traversals
10.5	Graphs With Weighted Edges
10.6	Shortest (or Lightest) Paths
10.7	Moravian Spanning Trees