CS294-152 Lower Bounds 10/8/18 === Restricted Circuit Lower Bounds P/poly: Decision problems with a poly-sized circuit family {C_n} over AND, OR, NOT. The class AC0 ("Alternating Circuits") captures problems with extremely low-depth circuits: AC0: Decision problems with a poly-sized circuit family over AND,OR,NOT, where every circuit has depth <= constant d (independent of n) and every gate has unbounded fan-in, i.e. each AND/OR gate can take in arbitrarily many inputs. Note: we do not count NOT gates in the depth or the size. AC0 can do interesting things! Addition of two numbers (Even the addition of O(log n) n-bit numbers, see Vollmer's "Introduction to Circuit Complexity". But not O(n) n-bit numbers, that would compute PARITY!) comparison of two numbers, maximum of a set of numbers, ... Thm [Ajtai, Furst, Saxe, Sipser 80s] PARITY is not in AC0. So counting the number of 1s modulo 2 is not in AC0. Thm [Hastad 86]: For all d there is an eps > 0 such that PARITY does not have AC0 circuits of depth d and size 2^{eps*n^{1/(d-1)}}. This is optimal up to the leading constant in the exponent! First, let's prove the theorem for d=2. Thm: The depth-2 AC0 circuit complexity of PARITY (and its negation) is 2^{n-1}. Proof: For d=2, AC0 circuits are DNFs and CNFs. Note that PARITY outputs 1 on exactly 2^{n-1} inputs. There are exactly 2^{n-1} n-bit strings with odd Hamming weight, and 2^{n-1} with even Hamming weight.) [[Could prove this by induction: it's true for n=1 and n=2. Inductively there are 2^{n-2} (n-1)-bit strings with odd parity, and 2^{n-2} with even parity. The n-bit strings of odd parity either start with a 1 and have even parity on the remaining (n-1) bits, or start with a 0 and have odd parity on the remaining (n-1). This makes 2^{n-2} + 2^{n-2} = 2^{n-1} strings of odd parity.]] There is a DNF and a CNF of size 2^{n-1} for computing PARITY on n variables. E.g. for the DNF: PARITY outputs 1 on O(2^{n-1}) inputs. Include an AND on n literals for each of those inputs. Take their OR. Now for the lower bound. It follows from: Claim: For every DNF for PARITY_n, every AND in the DNF contains *all* n input variables, and the DNF has >= 2^{n-1} ANDs. In other words, the "bottom fan-in" of the DNF must be n, and the size must be >= 2^{n-1}. Pf: Suppose some AND gate g in the DNF leaves out a variable x_i. Then there are two inputs, one with x_i = 0 and one with x_i = 1 (but all other bits the same!) such that g (and therefore the whole DNF) outputs 1 on both inputs. This DNF cannot compute PARITY! Since every AND in the DNF must contain all n variables, each AND gate can only make *one* n-bit input output 1. There are 2^{n-1} total inputs which make PARITY output 1. So there must be >= 2^{n-1} AND gates. An analogous argument shows every CNF for PARITY must have size >= 2^{n-1}, where each OR contains all n input variables. QED Now we show an upper bound for all depths d: Thm: For all d, PARITY (and its negation) has AC0 circuits of depth d and size 2^{O(n^{1/(d-1)})}. Pf: We already covered d=2. Now we handle larger d. PARITY on n variables can be written as a complete n^{1/(d-1)}-ary tree of depth d-1, where each node is a PARITY on <= n^{1/(d-1)} variables. (We are just writing a sum modulo 2 as a sum of sums of sums of ... of sums modulo 2.) Replace the root of the tree with a DNF of 2^{n^{1/(d-1)}-1} size computing PARITY on n^{1/(d-1)} variables. Each input to this DNF is some PARITY subtree, which is either negated or not in the DNF. On odd-numbered levels below the root, replace the PARITY on n^{1/(d-1)} inputs with CNFs of PARITY or NOT-PARITY (depending on whether it's non-negated or negated in the circuit above it). On even-numbered levels, replace the PARITY on n^{1/(d-1)} inputs with DNFs of PARITY or NOT-PARITY. Then, you can merge the ORs at the bottom of the CNFs with the ORs at the top of the DNFs, and the ANDs at the bottom of the DNFs with the ANDs at the top of the CNFs. When you are done, you have an AC0 circuit of 2^{O(n^{1/(d-1)})} size and depth only d. (One more level is added to the depth, because of the bottom layer.) QED *** Random Restrictions. The aforementioned lower bounds all use the method of *Random* Restrictions. Imagine the following experiment that we perform on the PARITY function, and separately on an AC0 circuit: Pick an input variable x_i at random, and set all other inputs to independent random 0/1 values. - What happens to the PARITY function? PARITY simplifies to either f(x_i) = x_i with probability 1/2, or f(x_i) = NOT(x_i) with probability 1/2; it depends on the parity of the other random values. - What happens to a "small" AC0 circuit C? A/FSS/H show: C simplifies to a CONSTANT (f(x_i) = 0 or f(x_i) = 1) with high probability! (Their results differ in how "small" the circuit needs to be, in order to simplify) - Conclusion: Small AC0 C cannot compute the PARITY function. *** AC0 Normal Form. A very useful property of AC0 is that every AC0 circuit has a convenient and structured normal form. In lower bound arguments, we'll assume our circuits have this form. Normal Form Thm: For every AC0 C of depth d >= 2 and size s, there is an equivalent C' of depth d and size 2ds+n+s such that: (1) C' has d+1 "layers": layer 1 := the n inputs along with their negations, and for i=2,...,d+1, all gates from layer i only receive input from layer i-1. (2) *All* NOT gates are in layer 1. (3) Beyond the input (layer 0), the odd layers all have the same gate type (OR/AND), and the even layers all have the same gate type (AND/OR), but is opposite from the odd layers Proof: For every gate g of C, we will make another gate g' which computes the negation of g. Layer 1: for each input x_i make a gate x'_i which computes NOT(x_i); this is n gates. For other layers: Suppose the gates are topologically ordered g_1,...,g_s. For i=1,...,s-1, we do the following (we do not do it to the output, g_s) Suppose inductively that for gate g_i, its inputs are the gates g_{i_1},...,g_{i_k} with i_j < i, and we already introduced gates g'_{i_1},...,g'_{i_k} computing their negations, respectively. If g_i = AND(g_{i_1},...,g_{i_k}), make the gate g'_i = OR(g'_{i_1},...,g'_{i_k}). If g_i = OR(g_{i_1},...,g_{i_k}), make the gate g'_i = AND(g'_{i_1},...,g'_{i_k}). We have g_i = NOT(g'_i), by DeMorgan's laws. Now, there are only n NOT gates in the circuit, in layer 1. Thus (2) is enforced. We have now added s new gates, so we have 2s gates. Furthermore, note that (not counting the NOT gates) the longest path in C from any input to g_s (the output) is *still* d. So at this point the depth of C is still d. To enforce (1) and (3), we use associativity of AND and OR to "group" together ANDs of ANDs and ORs of ORs. This ensures that - the inputs of each AND gate are either OR gates are inputs/NOTs from the first layer, and - the inputs of each OR gate are either AND gates or inputs/NOTs from the first layer. Note that this transformation cannot increase the depth. Finally, for wires in C that "skip over" k layers, we can replace them with a string of k AND (or OR) gates each of fan-in 1. Since k <= d, and there are 2s gates, this adds at most 2sd gates to C. QED **** Sketch of PARITY lower bound theorem. (Later we will see a complete proof) (1) We know the PARITY lower bound for d = 2: PARITY does not have DNF or CNF of size < 2^{n-1}. (2) For d >= 2: Suppose PARITY has a depth-d circuit C of size s. (a) Randomly assign a huge fraction of inputs to C (but leave a decent number unassigned). (b) Show that whp, the remaining circuit is equivalent to a circuit C' of depth d-1. That is, we show the depth can be "shrunk" by one, after a random restriction. (c) Repeat (a)-(b) for d-2 times, so you get a depth-2 circuit D where some k variables still remain. If s is "too small", then for some k > 0, we obtain a D for PARITY on k vars of size < 2^{k-1}. This contradicts (1). **** How to Shrink Depth. The key tool is Hastad's Switching Lemma. Def. A random restriction rho to v variables of f is the following random process: Given a function f, pick a subset of v variables at random, set them to independent random 0/1 values. Let f|_{rho} be the remaining function on n-v variables. Recall a k-CNF is an AND of OR of literals, where each OR has fan-in <= k. k-DNF is defined analogously, with OR of AND. Switching Lemma: Let p in (0,1/5). Suppose F is a k-CNF or k-DNF in n variables. Pick a r.r. rho to (1-p)*n variables of F. For all t, Pr_{rho}[F|_{rho} is equivalent to a t-DNF *and* a t-CNF] > 1-(10pk)^t. That is, after a random restriction to a random subset of the variables, we can "switch" a k-CNF to a t-DNF, and a k-DNF to a t-CNF, with high probability. Example: Think of p = 1/(20k). Then the Lemma says: if you randomly set all but n/(20k) of the variables in a k-CNF formula, the remaining formula on n/(20k) variables is equivalent to a t-DNF formula, with probability at least 1-1/2^t. So, the remaining formula has very high probability of being equivalent to a 100-DNF (for example). Why would this be useful? Suppose I have an AND of OR of ANDs of fan-in k, a depth-3 AC0 circuit. We can think of this as an AND of k-DNFs. Suppose I apply the Switching Lemma to the k-DNFs, and can convert all of them to t-CNFs. This means my depth-3 AC0 circuit is equivalent to an AND of AND of ORs of fan-in t... which is just a t-CNF itself! I have "shrunk" the depth by 1. *** How Switching Implies PARITY Lower Bound. Given an AC0 circuit C of size s in "normal form": 1. Make its bottom ANDs have small fan-in. --- Set k = 1, p = 1/20, and t = log(s). Consider the lowest layer of gates above the inputs and the NOTs. WLOG suppose they are AND gates. Can think of each AND gate is a "1-CNF". Switching Lemma => set all but p*n = n/20 of the inputs to random 0/1 values, then each AND gate on the lowest layer is equivalent to a t-DNF, with probability > 1 - (10pk)^t >= 1 - (10/20)^{log(s)} = 1 - 1/s. By the union bound, there is some assignment to all but n/20 inputs of C such that each AND gate on the lowest layer can be switched to have fan-in t <= log(s). Since the OR gates can merge into each other, this is still a depth-d circuit! Call this circuit C_0. We went from C with n inputs, depth d, size s, to C_0 with n/14 inputs, depth d, bottom ANDs of fan-in <= log(s), with <= s gates above the ANDs. 2. Switch each OR of ANDs to ANDs of ORs, and vice-versa. Do this d-2 times (until you get a DNF or CNF). --- Now set k = log(s), p = 1/(20 log(s)), and t = log(s). Switching Lemma => set all but n/(20^2 log(s)) of the n/14 inputs to C_0 randomly, then each OR of AND of k inputs (k-DNF) in C_0 is equivalent to a t-CNF, with probability > 1 - (10pk)^t >= 1 - 1/s. Again by union bound, there is an assignment to all but n/(20^2 log(s)) inputs of C_0 such that each k-DNF of C_0 can be switched into an AND of OR: a t-CNF. This switching reduces the depth by 1, by merging in the new AND gates with the AND layer above it. We can do an analogous thing for k-CNFs at the bottom, switching them into t-DNFs. Repeat step 2 for d-2 times. Each time, the number of inputs is divided by the factor 1/(20 log(s)). Note that for each gate g, the switching lemma is applied to it at most once. Then, the resulting "switched" circuit g' is merged in with the gates above it in the circuit. => Get a circuit of depth-2 for PARITY on n/(20 log(s))^{d-2} inputs, with "bottom fan-in" log(s). By our theorem on depth-2 PARITY circuits, the bottom fan-in of any depth-2 PARITY circuit must be at least the number of inputs. So we must have log(s) >= n/(20(log(s))^{d-2}). By simple manipulation, s >= 2^{n^{1/(d-1)}/20}. This completes the lower bound for PARITY. (Note: The "20" factor can be definitely made smaller, using other Switching Lemmas. I do not know if it is known to be "(1+eps)" for all eps. The factor is 1+o(1) for d=2. For d=3, it was shown that the factor is 1+o(1), in Paturi-Pudlak-Zane "Satisfiability Coding Lemma".) ======= Proof of the Switching Lemma So far we have shown that PARITY is not in AC0, except we left the Switching Lemma unproved. The proof we will present is a really cool/funky/weird proof by Razborov. It's a very clever counting argument. In general, the switching lemma and its proof remain very important in the modern study of circuit complexity. There are still many questions being answered about AC0, using new switching lemmas! See the work of Ben Rossman. We'll show a version with similar parameters: Switching Lemma': For k >= 1 and p in (0,1/2), For all k-DNF F on n variables, for all t, Pick a r.r. rho to (1-p)*n variables of F, Pr[F|_{rho} = some t-CNF] > 1 - (8pk/(1-p))^t. Note that this implies our original switching lemma statement, which said that for p in (0,1/5), the probability is > 1 - (10pk)^t. In particular, when p < 1/5 we have 1/(1-p) > 5/4, and thus 8/(1-p) > 10. For simplicity of notation, let v = (1-p)*n. Then, the total number of restrictions to v variables = {n choose v}2^v. (Why? There are {n choose v} ways to pick a subset of v variables, and 2^v ways to write down a bit string corresponding to an assignment to them.) Note that since p < 1/2, we have v > n/2, and the *larger* v gets, the *smaller* the number of restrictions will be! In other words, {n choose (v+t)}*2^{v+t} < {n choose v}*2^v for sufficiently large v = alpha*n. We will show: (# rho s.t. F|_{rho} != some t-CNF) <= (# restrictions to (v+t) variables) * (4k)^t (*) Note the RHS of (*) is {n choose (v+t)} * 2^{v+t} * (4k)^t. Assuming (*) is true, the proof of the lemma follows from some calculations: Pr[F|_{rho} != some t-CNF] = (# rho s.t. F|_{rho} != some t-CNF) -------------------------------- (# restrictions on v vars) <= {n choose (v+t)} * 2^{v+t} * (4k)^t by (*) -------------------------------------- {n choose v} * 2^v <= {n choose (v+t)} * (8k)^t ----------------------------- since k >= 2 {n choose v}. = {n choose (1-p)n + t} * (8k)^t substitution ------------------------------ {n choose (1-p)n} = n!/(n-pn + t)!(pn-t)! * (n-pn)!(pn)!/n! * (8k)^t def of binomial coefficients = (n-pn)!(pn)!/(n-pn+t)!(pn-t)! * (8k)^t = (pn)*...*(pn-t+1)/(n-pn+t)...*(n-pn+1) * (8k)^t <= ((pn)/(n-pn))^t * (8k)^t <= (p/(1-p))^t * (8k)^t. Now we turn to proving the inequality (*). *** Proof Strategy for (*): Given a fixed k-DNF F, find a one-to-one mapping from the set A of things counted on LHS of (*), into the set B of things counted on the RHS. In particular, we show how to uniquely encode a "bad" restriction rho from the LHS as a restriction on (v+t) variables + "advice" string of t*log(k)+2t bits. The number of the latter is {n choose (v+t)} * 2^{v+t} * (4k)^t, which is the RHS of (*). So then we'll have |A| <= |B|. We need to study the set A of restrictions rho on v vars which fail to turn the k-DNF F into an equivalent t-CNF, and show how to encode them as restrictions on v+t vars + advice. First, we need to deduce some structural properties of functions which are *not* t-CNFs. Def. For F: {0,1}^n -> {0,1}, a max-term of size s is a restriction rho on s vars s.t. F|_{rho} = 0 and s is *minimal*. Here "minimal" means that, if we "unassign" any of the s variables in rho, then F is no longer equivalent to 0. Here is the crucial property about "not being a t-CNF" that we will use. Claim: (F != some t-CNF) => there is a max-term of F of size >= t+1. Proof: By contrapositive. If all max-terms for F are of size <= t, then F is equivalent to a t-CNF: It's equivalent to an AND over all max-terms of size <= t, where each OR on <= t vars encodes the (<= t)-variable partial assignment of some max-term. (OR outputs 1 <=> the input does not agree with the max-term). That is, our formula is checking whether for all max-terms rho, the input assignment does not agree with rho. - If for some rho, the input assignment agrees with rho, then F should output zero (because F|_{rho} = 0). - For the other direction, suppose F outputs zero on an input assignment a. Then there is some partial assignment agreeing with a which is a max-term for F (it's a minimal partial assignment). The corresponding OR for that max-term will output 0, so the whole formula will output 0. QED Restating the claim in terms of the definition of max-term, Claim (Restated): If F != some t-CNF, there is a partial assignment pi on >= t+1 variables of F such that: - F|_{pi} = 0, i.e., for every AND-term of F, pi sets at least one literal in the AND-term to false. - F|_{pi'} != 0, for every "sub-restriction" pi' on <= t variables of those set in pi. Let F be a k-DNF on the AND-terms C_1,...,C_m, which we call "clauses" in the below. Let rho be a "bad" restriction to v variables such that F|_{rho} is *not* equivalent to a t-CNF. By the Claim, F|_{rho} has a max-term pi on >= t+1 vars. Let rho*pi be the combined restriction on >= v+t+1 variables. Then: * F|_{rho*pi} = 0, i.e., for every clause C_i, at least one literal is set false by rho*pi * F|_{rho*pi'} != 0, for all "sub-restrictions" pi' which set <= t variables of those set in pi. Using F, we will define a 1-1 mapping that maps the bad restriction rho to (rho*sigma, c), where sigma is an assignment to some t other variables that are set in pi, and c is O(t log k) bits of "advice" that let us "decode" the original restriction rho from rho*sigma. (Note: if we are only given rho*sigma, we just see a subset of v+t variables, followed by an assignment to them under the lex ordering on the variables -- we can't tell which variables were set by rho, and which were set by sigma. That's precisely what the advice c will tell us!) Then, rho*sigma will assign v+t variables, and we'll have a one-to-one mapping from A to B, because for every (rho*sigma, c) we can uniquely identify the original rho. Our sigma and advice c will use the k-DNF F in a crucial way. *** Construction of sigma: Let pi_0 be an "empty" restriction that sets no variables. For i = 1,2,... Find the first clause C_{j(i)} not set to 0 by pi' = rho*pi_0*...*pi_{i-1} <-- on <= t variables (Since no sub-restriction pi' of rho*pi sets *all* clauses to 0, such a C_{j(i)} must exist!) Let X_i be the set of vars in C_{j(i)} set by rho*pi, but not by pi'. (Since rho*pi sets every C_i to zero, X_i is not empty!) Let pi_i : X_i -> {0,1} be the map that sets all vars in X_i according to rho*pi. Let sigma_i X_i -> {0,1} set all vars in X_i in such a way that C_{j(i)} is *not* zero: sigma_i satisfies all literals in C_{j(i)} whose variables appear in X_i. Repeat until X_1,...,X_i contain exactly t variables. (If they set more variables in the last clause, "undo" those assignments so it's exactly t.) Example of sigma_i: Suppose our clause C_{j(i)} = (x1 AND x2 AND NOT(x3) AND x4) and X_i = {x3,x4} where the vars set in rho*pi but not in pi'. Then, pi_i sets x3 = 1, or x4 = 0, or both (because the rho*pi zeroes out the clause) but sigma_i maps x3 to 0, and x4 to 1. Our mapping will send rho to (rho*sigma_1*...*sigma_i, c), for some side information c. (Note that by our construction, sigma_1*...*sigma_i sets exactly t variables.) *** Construction of advice c: To figure out a good c, let's see what is needed in order to determine rho from rho*sigma_1*...*sigma_i. If we can determine the sets X_1,...,X_i, then we can determine rho: rho is the assignment to those vars set in rho*sigma_1*...*sigma_i that do not appear *not* in any X_i. Suppose we plug rho*sigma_1*...*sigma_i into F. Recall C_{j(1)} is the first clause not set 0 by rho*pi_0. A key observation is: Claim: C_{j(1)} is also the first clause not set to 0 by rho*sigma_1*...*sigma_i. Why? * Restriction rho does not set C_{j(1)} to 0 (by def of C_{j(1)}), * sigma_1 is *defined* not to set C_{j(1)} to 0, and * No other sigma_{\ell} assigns variables in C_{j(1)} (for ell = 2,...,i). Why is the last bullet true? Well, every sigma_{ell} for ell=1,...,i assigns vars from some set X_{ell}, which are assigned in rho*pi. But X_1 (the variables assigned by sigma_1) is the set of *all* variables in C_{j(1)} that are assigned in rho*pi. So no other sigma_{ell} will assign variables in C_{j(1)}. (This is practically by definition.) Thus we can determine C_{j(1)} from rho*sigma_1*...*sigma_i, by plugging in the partial assignment and finding the first clause not set to 0. Our side information c will tell us how to encode pi_1, given we know C_{j(1)}. Key observation: Need only O(|X_1|*log(k)) bits to encode pi_1, because C_{j(i)} has <= k variables! We can encode pi_1 by encoding: - All indices of the vars in C_{j(1)} set by pi_1 We can use log_2(k) + 1 bits per variable in X_1. As there are <= k variables in C_{j(1)}, we can use log_2(k) bits to encode a variable index. Since we don't know |X_1| (but we know k), we can use one extra bit to encode a "stop" character, to signal that our encoding of X_1 has reached the end. - The values assigned to these vars by pi_1, using |X_1| extra bits. Hence we can use |X_1|*(log_2(k) + 1) + |X_1| = |X_1|*(log_2(k) + 2) bits of advice to determine pi_1 from rho*sigma_1*...*sigma_i. Suppose we use this advice to determine pi_1 and then plug rho*pi_1*sigma_2*...*sigma_i into F. This will zero out the clause C_{j(1)}. Now: Claim: C_{j(2)} = first clause not set zero by rho*pi_1*sigma_2*...*sigma_i. As in the previous case, C_{j(2)} is the first clause not set zero by rho*pi_1. Using O(|X_2|*log(k)) bits of side information, where X_2 is the set of variables assigned by pi_2, we can reconstruct the assignment pi_2 from rho*pi_1*sigma_2*...*sigma_i. Continuing like this for all j = 1,...,i, we can compute the partial assignment rho*pi_1*...*pi_i, and thus determine exactly which variables X_j are set by the pi_j's, for all j = 1,...,i. Once we know that, we know the remaining variables are the restriction rho. For each pi_j, we use |X_j|*(log(k)+2) bits of side info. The amount of side information we need is at most sum_{j=1}^i |X_j|*(log_2(k) + 2) <= t*log(k) + 2t, since X_1,...,X_i together are on t variables. This gives us the k^t*4^t = (4k)^t term. Thus, using this c, we get a one-to-one mapping from rho's to partial assignments on t+v variables with O(t*log(k)) extra advice. This completes the description of the side advice, and the proof! ==== Note: A small improvement on the constant in the switching lemma. We need to keep the ordering on the X_1,X_2,...,X_i in order to decode correctly. Suppose we only encode indices in log_2(k) bits (but I don't tell you |X_j| for any j) but we write the indices in either (a) increasing order for the set X_j, or (b) in decreasing order for the set X_j. The idea is that when the order changes, it means the set has changes. Want to claim: Regardless of the X_j's, there is a way to impose an increasing or decreasing order on each X_j, such that we will always be able to tell when we move from X_j to X_{j+1}... Certainly for every *pair* of X_j, X_{j+1}, we can do this. Suppose the smallest index in X_{j+1} is smaller than the largest index in X_j. Then we can write the X_j indices in increasing order, and X_{j+1} in increasing order: there is an increase until an abrupt decrease, then an increase again. Suppose the smallest index in X_{j+1} is larger (or equal to) the largest index in X_j. Then we can write X_j in decreasing order, and then X_{j+1} in increasing order, and still tell when we move from X_j to X_{j+1}. We could add one bit to the end of each index in X_{j+1}, to signal the end of that set. We can either add these bits to the end of the even j's or the end of the odd j's, whichever is smaller. This means we use at most t*log_2(k) + t/2 + t bits, which gives instead of a 2^3 = 8 factor, a 2^{1+3/2} = 5.65 factor. Could we encode triples? Probably the best we could hope for is t*log_2(k) + t bits, giving a factor of 4. ==== The main case of the switching lemma that we care about (in the application to AC0 lower bounds) is when t=k. Then, we have to encode sets X_1,...X_i where sum_j |X_j| = k, along with a k-bit assignment to the vars in the sets. We can learn each set X_j as a subset of variables in some clause C_j of size <= k. And we can only learn X_j after learning X_1,...,X_{j-1}, it seems...