6.S078 Lecture 12 (3/21): Linear Decision Trees, Computing APSP ======================================================== In this lecture, we consider functions of the form f : R^m -> {0,1}^{m'}, where m,m' are positive integers. APSP: Function with m=n^2, where n = number of nodes. Think of output as providing for all i,j in [n], some k minimizing A[i,k]+B[k,j]. This is O(n^2 log n) bits. k-SUM: Function with m=n, m'=O(k log n), where n = size of number list. Output is a k-tuple of the input that sums to 0. So far, the only computational model we have seen for solving such problems is the Real RAM, which lets us do additions, subtractions, and comparisons with contents of real-valued registers, and word tricks in "normal" word registers. Today we will study a *non-uniform* model that captures what the Real RAM can do, and show there are surprisingly fast algorithms for 3-SUM and APSP in this model. In fact, these fast non-uniform algorithms are what led to the first speed-ups in the Real RAM for both of these problems. Studying these algorithms more closely may help us understand the extent to which the APSP and 3-SUM conjectures are actually true. Model: LINEAR DECISION TREES (LDT) Definition: An LDT for a function f : R^m -> {0,1}^{m'} of width L is a binary tree where each inner node is associated with some inequality [alpha_{i_1} x_{i_1} + ... + alpha_{i_L} x_{i_L} \geq t] where the x_{i_j}'s are variables from the set {x_1,...,x_m}, alpha_{i_j} in {-1,0,1}, t is in R. Each leaf of the LDT contains a string from {0,1}^{m'}. We think of L as a constant. That way, in the real RAM, each node inequality could be "computed" in O(1) time, provided these x_i's and t are sitting in real-valued registers. [Note: If the alpha_{i_j} were arbitrary reals, then the functions at the nodes are linear threshold functions (LTFs), which are also studied in neural networks and circuit complexity. It's not clear whether extending the domain of the coefficients gives any new power.] Computation of an input \vec{x} = (x_1,...,x_m) on a LDT works as follows: - Start at the root of the LDT. - If the inequality at the current node is true of \vec{x}, move to left child else move to right child. - Once a leaf is reached, output its string. This is a non-uniform model of computation! - For every integer m, the LDT for R^m may be very different. - There is *no* computational cost on the *construction* of the LDT. It could be that the leaves/LTFs take a long time to determine computationally from the previous inequalities that have been checked. We measure the "running time" of an LDT by its *depth*, the maximum length of a path from the root to a leaf, because that's the worst case number of inequalities that need to be computed, and each inequality would take O(L) time in the real RAM (if we knew the parameters). Sanity Check: Sorting real number is a function SORT : R^n -> {0,1}^{O(n log n)}, where the output is a bit string encoding a permutation of [n]. Fact: Sorting has a LDT of width 2 and depth O~(n). In fact, all inequalities of this LDT have the form x_i - x_j <= 0 to compare two inputs x_i and x_j We'll critically use this Fact in our APSP and 3-SUM algorithms. *** APSP The first surprising algorithm we'll cover is for APSP: Theorem [Fredman 75] APSP has an LDT of width 4 and depth O~(n^{2.5}). Recall we showed: min,+ MM on n x d x n in T(n,d) time ==> min,+ MM on n x n x n in O(n/d(T(n,d) + n^2)) time, by simply splitting the n x n x n MM into n/d separate n x d x n MMs, compute each MM getting matrices of size n^2, then take the component-wise minimum of the n/d matrices. This reduction also works when we replace "time" with "depth in LDTs": All our reduction does is compute mins of numbers by comparisons, which we can do with an LDT. Claim: min,+ MM on n x d x n has a LDT of width 4 and O~(n d^2) depth. When d << n, this is very surprising: we can infer an output of n^2 real numbers with only O(n d^2) additions and comparisons! Assuming Claim, we then get an LDT of depth O~(n/d*(n d^2 + n^2)) <= O~(n^2 d + n^3/d). Set d = sqrt{n} to get the theorem. Let's now prove the claim. We are given A in R^{n x d} and B in R^{d x n} and want the n x n: C(i,j) = min_{k=1,...,d}(A[i,k] + B[k,j]) Let's use the notation [n] := {1,...,n} Our task is therefore: (*) For all i,j in [n], find k* in [d] such that for all k in [d], A[i,k*] + B[k*,j] <= A[i,k] + B[k,j]. These k* will tell us the minimum A[i,k] + B[k,j]. We will use the astounding fact that the above is equivalent to finding k* such that for all k, A[i,k*] - A[i,k] <= B[k,j] - B[k*,j] Here's a high-level description of the LDT: 0. L := empty list. 1. For all i in [n], (k,k') in [d]^2, Append a_{i,k,k'} = A[i,k'] - A[i,k] to L. Append b_{i,k,k'} = B[k,i] - B[k',i] to L. 2. Sort L in O~(n d^2) depth with a width-4 LDT. (All nodes have inequalities of the form "difference of two inputs <= difference of two inputs") 3. For all i,j in [n]^2, pick out the sublist on just a_{i,k,k'}'s and b_{j,k,k'}'s. From the sorted order we can determine the smallest k* such that for all k, a_{i,k,k*} <= b_{j,k,k*}. This determines the entry C[i,j]. The sorted order DETERMINES the output! Note: no additions/comparisons are needed for step 3. This completes the LDT. How to get a Real RAM algorithm? The "only" step we don't know how to do in the Real RAM (uniform) model is step 3: We don't know how to *quickly* determine the output matrix C from the sorted L of size O(nd^2). However, we can give a Real RAM algorithm that "shaves" some log factors. Thm [Fredman'75] APSP is in O(n^3/log^{1/6} n) time in the Real RAM. (What I'm about to say is not at all optimized, it's just meant to be simple and illustrative.) Proof: Build a lookup table for a small LDT. In time 2^{O~(a^{2.5}) poly(a)} we can build a width-4 LDT for (min,+) MM on (a x a) matrices, by mimicking the above algorithm at every step. (Given e.g. O(a^4) time, we could easily infer the output from the sorted order...) Note that we can store this LDT in "normal" words: we do not need reals to store it, as the coefficients are {0,1,-1} and the threshold value t is 0. We can store each node of the lookup table in a separate word. Then given two a x a matrices with real-valued entries, we can determine the (min,+) MM on a x a matrices in O~(a^{2.5}) time, by following the LDT down a path and computing the 4 additions/comparisons at each node. Set a = (log n)^{1/3}. Then it takes n^{o(1)} time to build the LDT. Next, given n x a and a x n matrices, we can "block" them into n/a a x a matrices each. Using our LDT on pairs of blocks, we get O(n^2/a^2 * a^{2.5}) <= O(n^2 * a^{1/2}) time. Finally, to compute n x n (min,+) MM, we can partition into n/a MMs of n x a x n. Applying our n x a x n algorithm, we get time O(n/a * n^2*a^{1/2}) <= O(n^3/(log^{1/6} n)). QED *** 3-SUM Theorem [GP'14] 3-SUM has an LDT of width 4 and depth O~(n^{1.5}). The most amazing thing about this theorem is that it took almost 40 years to prove after Fredman... once you see it, you'll know what I mean. I believe the only reason it stayed open was because people believed it was not possible, thanks to: Theorem [Erickson'99] Every LDT of width 3 for 3-SUM has depth Omega(n^2). Since 3-SUM is comparing triples of numbers anyway, people didn't figure that width 4 would help... Our starting point is the 3-SUM algorithm we gave that runs in O(n^2) time. We'll discuss this next time.