Data Structures and Algorithms

Minimum Spanning Trees

Given a weighted, undirected graph, find a subgraph $T = (V, E')$ of the edges such that the vertices are connected and that the sum of edge weights in $T$ is minimal.

See a cycle in $T$ ? Remove an edge! Therefore, $T$ must be a tree (no cycles). The root of the tree doesn't matter.

In a planar MST, the points are on a flat plane. All pair-wise edges are present, and the weights are the distances.

Presence of the Shortest Edge in MST (Proof)

How could we prove that the shortest edge must be in every possible MST? We could use proof by contradiction. Let's say we leave the lowest cost edge out. This would make it not a minimal spanning tree, since one of the edges isn't connected by an edge. So add an edge connecting with it! This edge that you add has to be greater than or equal to the edge that you removed, since it is the shortest. This would make it not a MST anymore and this is a contradiction. QED.

This is the same for the second shortest edge. The third shortest edge, however, does not have to be in the graph! It might or might not be in, depending if it makes a cycle or not.

It's also possible to have more than one possible MST for a given graph, if two paths are equal.

Finding MST: Prim's and Kruskal's Algorithms

These algorithms find MSTs on edge-weighted, connected, undirected graphs. These greedily select edges one by one and add them to a growing sub-graph. These actually find a global solution.

Prim grows a real tree
Kruskal grows a forest of trees that eventually merges into a single tree.

Prim's Algorithm

Take the vertices, and separate them into the in-set, and the out-set. Initially, the in-set is empty and the out-set is everything else. Select the first innie arbitrarily (root of MST) iteratively, and then choose the outie with the smallest distance from any innie. Move this vertex from out-set to in-set.

Implementation issue: use linear search or pq? How do we find that shortest edge?

Data Structures for Prim's

Three arrays.

For each vertex, record:

Has it been visited?
What is the minimal edge weight to it?
What vertex is it's parent?

Repeat until every node is visited:

1. Find the unvisited vertex v with the smallest tentative distance.
2. Set v as visited.
3. For each unvisited adjacent vertex w, test whether the minimal distance of w is greater than the minimal distance of v. If it is, set the distance of w to be the distance of v, and set the parent of w to be v.

Make sure to watch the example in the lecture 19 at around 25:00 to understand this. Much easier understood seen than read.

Prim's Complexity

You keep looping through all of the unvisited vertices $|V|$ times. To find the lowest distance, you have to use linear search $O(|V|)$ , or using heaps $O(\log |V|)$ . When attempting to update minimum distances and parents, you visit every edge once ( $O(|E|)$ ), and for all edges, you use $O(\log |V|)$ with heaps or $O(|V|)$ without.

With two loops and linear search, this would be $O(|V|^2)$ : $V*V + E*1 \approx V^2$

With heaps, this would be $O(|E| \log |V|)$ : $V * \log V + E * \log V \approx E \log V$ for a sparse graph where $|E| << |V|^2$ . In a dense graph, this would be $O(V^2 \log V)$ .

With Prim's, if someone is giving you a dense graph, you want to use a simple two-loop approach for $O(|V|^2)$ . Else, use heaps for $O(|E| \log |V|)$ for a sparse graph.

Kruskal's Algorithm

Kruskal's has the same conditions (edge-weighted, undirected, etc), and you sort all the edges first. This has complexity $O(E \log E) = O(E \log V)$ . This is only good for sparse graphs – don't use Kruskal's algorithm for a dense graph. Let me repeat that.

Don't use Kruskal's algorithm for a dense graph.

Try inserting in order of increasing weight. Some edges may be discarded so as not to create cycles.

The initial two edges may be disjoint – remember that we are growing a forest, not a tree!\

Sorting takes $O(E \log V)$ time. This is the bottleneck of the entire algorithm. The remaining work is a loop over $E$ edges. Discarding is a trivial $O(1)$ . Adding is amortized $O(1)$ . Most of the time is spent trying to find a cycle. Good news: it takes less than $\log E \approx \log V$ time!

The key idea here is that if two vertices are connected, then a new edge would create a cycle. Only need to maintain disjoint sets!

Finding if adding a vertex would create a cycle

You can actually use union find to keep track of this! For each set, you have a representative to keep track of this. Every disjoint set should have a unique representative. To tell if two vertices are in the same set, compare their representatives! This redundancy check becomes fast.

Eventually, all of the $N$ disjoint sets (vertices) become one set (MST).

The way that this works is that you maintain a linked list to ultimately find the representative of the group. This makes union() really fast – one of the representatives will need to know the other one. find(), however, becomes a little bit slower. union() cannot be faster than find(). During these find() calls, you actually go over TWICE, once to find the representative at the end, and the second time to update the representative of the rest of the nodes. This makes union() faster.

This makes the asymptotic complexity a bit complicated. It is:

$O(N \alpha (N))$ , where $\alpha(N)$ grows very slowly. It is basically a constant. In practice, almost linear time performance. Details are taught in more advanced courses.

Summary

For dense graphs, use two-loop Prim's. For sparse graphs, use Kruskal's algorithm.