\documentclass[10pt]{article}
\newtheorem{define}{Definition}
%\newcommand{\Z}{{\mathbb{Z}}}
%\usepackage{psfig}

\oddsidemargin=0.15in
\evensidemargin=0.15in
\topmargin=-.5in
\textheight=9in
\textwidth=6.25in

\begin{document}
\input{preamble.tex}

%{lecture number}{lecture date}{Ronitt Rubinfeld}{Your name}
\lecture{6}{February 27, 2006}{Ronitt Rubinfeld}{Benjamin Snyder}

%%%% body goes in here %%%%
\noindent
Our general goal over the course of the next few lectures will be the following:  given
some $f : \{\pm1\} \rightarrow \{\pm1\}$, output $h : \{\pm1\} \rightarrow \Re$ s.t.
$\Pr_x[ f(x) \ne sgn(h(x)) ] \le \varepsilon$.
\\ \\
In particular, we will be working in the framework known as ``Learning with
Queries under the Uniform distribution.''

\begin{enumerate}
\item We are allowed queries of the form, ``what is $f(x)$?'' (some learning
frameworks allow more general questions).  In fact, we will see that in our
first example we will only require query access for randomly  
chosen values of $x$, instead of for arbitrary values of our own choosing.

\item Error is computed under a uniform distribution.  i.e. no ``special inputs''
\item We won't know which inputs are likely to produce errors -- we just have a bound
on the total error.

\item If $f$ is random (in the sense that a bit is chosen at random to be the output
for a particular input), then there is no way to find such  an $h$ in less than exponential
time, as we would have to sample each possible input.  Instead we will look at learning
$f$ when it comes from certain special families of functions such as (i) constant depth
polysize circuits, and (ii) the set of ``small'' decision trees.

\item Finally, it should be noted that the resulting function $h$ need not be from the
      same special family as $f$.
\end{enumerate}
\noindent
Our algorithms will find a function that approximates $f$ by approximating certain
subsets of the Fourier coefficients.  In the case of small depth circuits, we will
approximate all Fourier coefficients corresponding to small sets $S$, and in the case
of decision trees, we will approximate all Fourier coefficients that have large enough
absolute value.  Thus we will have two potential sources of error -- the first is that we
are completely ignoring most Fourier coefficients and the second is that even for those
that we are considering, we only get an approximation. 
\\ \\
Note however, that we can approximate any Fourier coefficient with arbitrary precision
using sampling in the following way:
for any given set $S \subseteq [n]$, with
$O(\frac{k}{\delta^2})$ samples we can approximate $\hat{f}(S) = 1 - 2\Pr_x[f(x) \ne
\chi_S(x)]$ (an equality proved in a previous lecture) with some value $\tilde{f}$, where 
$\Pr[ | \tilde{f}(S) - \hat{f}(S) | > \delta ] \le e^{-k}$.
This is done by sampling values of $x$ and estimating $\Pr_x[f(x) \ne \chi_S(x)]$.  The
proof proceeds with a straighforward application of Chernoff bounds.
\\ \\
Of course, we will still need to analyze the effect these approximations have on the overall 
error of the outputted function $h$ for each of the individual algorithms.

\section{Learning Circuits}
First we take up the issue of learning $f$ when it is in the family of constant depth
polysize circuits. \\
\\
{\bf input}: $X_1, X_2, ... , X_n \in \{0,1\}^n$
\\
{\bf gates}: $\vee, \wedge, \neg$
\\ \\
Our $\vee$ and $\wedge$ will have unbounded fan-in (i.e. they can take any subset of
the variables as inputs).
\\ \\
Also, by DeMorgan's law we can always push negation into conjunctive and disjunctive expressions 
(e.g. $\neg (X_1 \wedge X_2) = \neg X_1 \vee \neg X_2$). We will assume that that this has been and that the
negation gates will always occur (if at all) at the first level of the circuit, and we 
will therefore not include negation gates in its size.
\\ \\
Now we will switch back to the bit notation of $\{\pm1\}$ and assume that $\wedge$, $\vee$ and
$\neg$ have been mapped to operations with equivalent truth-tables (i.e. where $+1$ takes the
place of $0$, and $-1$ takes the place of $1$).
\\

\begin{definition}
Let ${\cal F}(s,d)$ be the family of functions computable via circuits of size $\le s$ and depth $\le d$
\end{definition}
\noindent
For example, ${\cal F}(2^n, 2)$ is the set of all n-bit functions (since the entire truth-table
can be directly encoded using $2^n$ gates).
\\ \\
{\bf Question}: does ${\cal F}(\poly(n), d)$ (known as $AC^0$) contain the parity function?
\\ \\
This problem was answered in the negative by a series of works due to several authors.  In
fact, what this line of work was able to show was the following:

\begin{theorem}[Hastad, Linial Mansour Nisan]\label{thm1}
For any function f in ${\cal F}(s,d)$, and for $t = 28 \times (14 \times \log(\frac{2s}{\alpha}))^{d-1}$, 
we have $ \sum_{|S| > t} \hat{f}^2 (S) \le \alpha$.
\end{theorem}
\noindent
To make sense of the value of $t$ as we will use it, consider the case where $s = \poly(n)$,
$d$ is a constant, and $\alpha$ is an error constant.  Then we have 
$t = O(\log^d(\frac{n}{\alpha}))$, and we can understand the theorem intuitively as
saying that Fourier coefficients of large sets $S$ (i.e. of linear functions with large
multipliers) can't have too much energy.
\\ \\
Now note that the parity function places all its weight on the single Fourier coefficient
$\hat{f}([n])$ and therefore by Parseval's identity, for any $t < n, \sum_{|S|>t} \hat{f}^2(S)
= 1$. Then, by Theorem \ref{thm1}, $AC^0$ does \emph{not} contain parity.
\\ \\
This is the bad news.  However, some good news can also be found in Theorem \ref{thm1} as the 
following results shows.

\begin{theorem}[LMN]\label{thm2}
For any function $f \in {\cal F}(s,d), \exists$ an algorithm that in time only 
$n^{O(\log(\frac{s}{\varepsilon}))^{d-1}}$ returns a function $g$ s.t. 
$\Pr_x[f(x) \ne sgn(g(x))] \le \varepsilon$.
\end{theorem}

\begin{corollary}
For any function $f \in {\cal F}(\poly(n),d), \exists$ an algorithm that in time only
$n^{O(\log(\frac{n}{\varepsilon}))^{d-1}}$ returns a function $g$ s.t.
$\Pr_x[f(x) \ne sgn(g(x))] \le \varepsilon$.
\end{corollary}

\medskip

\noindent
\textbf{Algorithm} \\
\indent {\bf Input:} function $f$, computable by some circuit of size $s = poly(n)$, constant depth $d$. \\
\indent Let $\alpha = \varepsilon/2$, $t = 28 \times (14 \times \log(\frac{2s}{\alpha}))^{d-1}$, and 
$\delta = \sqrt{\frac{\varepsilon}{2n^t}}$ \\
\indent Use sampling to compute $\tilde{f}(S)$, for $S$ s.t. $|S| < t$, to within error $\delta$ of $\hat{f}(S)$.\\
\indent Return $g(x) = \sum_{|S| < t} \tilde{f}(S) \chi_S(x).$ \\ \\
\medskip
We need $O(\frac{k}{\delta^2}) = O(\frac{n^t}{\varepsilon})$ samples to estimate each $\tilde{f}(S)$ within 
error $\delta$ of $\hat{f}(S)$ with confidence $1-e^{-k}.$ \\ \\
Now, 
\begin{eqnarray*}
\Pr_x[g(x) \ne f(x)] &\le& \E[(f(x) - g(x))^2] \\
&=& \sum_{S}(\hat{f}(S) - \hat{g}(S))^2 \\
&=& \sum_{|S| < t}(\hat{f}(S) - \tilde{f}(S))^2 + \sum_{|S| > t}\hat{f}(S)^2 \\
&\le& \sum_{|S| < t}\delta^2 + \alpha\\
&=& \sum_{i=1}^t {n \choose i} \delta^2 + \alpha \\
&\le& n^t \delta^2 + \alpha \\
&=& \varepsilon/2 + \varepsilon/2 \\
&=& \varepsilon \\
\end{eqnarray*}

\section{Learning Decision Trees}

Recall from last time the Goldreich-Levin (GL) algorithm, whose existence can be
stated as the following theorem:

\begin{theorem}
$\exists$ an algorithm that given parameters $\Theta$ and oracle access to $f : \{\pm1\}^n 
\rightarrow \{\pm1\}$, outputs in time $\poly(n,\frac{1}{\Theta})$ a list $L$ of subsets
of $[n]$ that with high probability finds all $S \subseteq [n]$ with $|\hat{f}(S)| \ge 
\Theta$ (and contains no $S$ s.t. $|\hat{f}(S)| \le \frac{\Theta}{2}$)
\end{theorem}


\begin{definition}
An \emph{m-node decision tree} is any tree of size $m$
whose internal nodes are labeled by values $i \in [n]$ and whose leaf nodes are
labeled by values $j \in \{\pm1\}$.  
\end{definition}

\begin{definition}
A function $f : \{\pm1\}^n \rightarrow \{\pm1\}$ is \emph{computable by an m-node decision tree} if for $\forall x \in \{\pm1\}^n, \exists$ an m-node decision tree s.t. the leaf reached from the traversal of $x$ down the tree has value $f(x)$.
A traversal is governed by the rule that
at a node labeled with $i$, if $x_i = -1$, the traversal continues down the left
child, and if $x_i = +1$, the traversal continues down the right child.
\end{definition}

\begin{claim}\label{claim1}
For any function $f$ computable by an m-node decision tree, the $L_1$ norm of $f$
(defined to be $\sum_Z |\hat{f}(Z)|$) is small (i.e. $\le m$).
\end{claim}

\begin{claim}\label{claim2}
When $L_1(f)$ is small, the large Fourier coefficients are enough to approximate
$f$.  i.e., $g(x) = \sum_{S s.t. |\hat{f}(S)| \ge \varepsilon/2 L_1(f)}
\hat{f}(S) \chi_S(x)$ is s.t. $\E[(f-g)^2] \le \varepsilon/2$
\end{claim}
\noindent
As a result of these claims (which we will prove), we can run the GL algorithm with threshhold parameter 
$\varepsilon/2 L_1(f)$
to obtain a list $L$ of large coefficients.  However, there are two apparent difficulties
with this approach.  (i) GL returns only the \emph{identity} of the large Fourier coefficients,
but not their actual values, and (ii) we do not know the value of $L_1(f)$.  
\\ \\
To solve the first problem, we will again use sampling to estimate the Fourier coefficients (of the
sets $S \subseteq [n]$ which GL returns).
\\ \\
To circumvent the latter problem, we can run GL with a  
sequence of decreasing parameters: $\left\{\varepsilon/2 \tilde{L_1^i}(f)\right\}_{i=0}$,
where $\tilde{L_1^i} = 2^i$, stopping after obtaining a function $h$ s.t. $\Pr_x[ f(x) \ne sgn(h(x)) ] \le \varepsilon$ (to be determined by sampling $x$).  By Claim \ref{claim1}, after at most $\log(m)$ tries, we will have $\tilde{L_1^i} \ge L_1(f)$. In our
exposition of the algorithm, we will assume for the sake of simplicity that $\tilde{L_1^i} = L_1(f)$\\ 

\noindent
\textbf{Algorithm} \\
\indent Let $\tau = \varepsilon/2 L_1(f)$, $\delta = \sqrt{\frac{\varepsilon}{2|L|}}$.  
(Note:  $|L| = \poly(\frac{n}{\tau})$)
\\ \\
\indent Use GL to find $L$, the list of all $S$ s.t. $|\hat{f}(S)| \ge \tau$.
\\
\indent Use sampling to compute $\tilde{f}(S)$ within error $\delta$ of $\hat{f}(S)$.
\\
\indent Return $h(x) = \sum_{S \in L} \tilde{f}(S) \chi_S(x)$.
\\ \\
We need $O(\frac{k}{\delta^2}) = O(\frac{2 \poly(\frac{n}{\tau})}{\varepsilon}) = O(\poly(\frac{nm}{\varepsilon}))$ samples to estimate each $\tilde{f}(S)$ within $\delta$ of $\hat{f}(S)$ with confidence $1-e^{-k}$.
\\ \\
\medskip
\noindent
Now, let $g(x) = \sum_{S s.t. |\hat{f}(S)| \ge \tau} \hat{f}(S) \chi_S(x)$. \\
Then,
\begin{eqnarray*}
\Pr_x[h(x) \ne f(x)] &\le& \Pr_x[h(x) \ne g(x)] + \Pr_x[g(x) \ne f(x)] \\
&\le& \E[(h(x) - g(x))^2] + \E[(g(x) - f(x))^2]\\
&\le& \sum_{S \in L}(\hat{h}(S) - \hat{g}(S))^2 + \varepsilon/2\\
&=& \sum_{S \in L}(\tilde{f}(S) - \hat{f}(S))^2 + \varepsilon/2\\
&\le& \sum_{S \in L} \delta^2 + \varepsilon/2\\
&=& |L|\delta^2 + \varepsilon/2\\
&=& \varepsilon/2 + \varepsilon/2\\
&=& \varepsilon \\
\end{eqnarray*}

\noindent
We restate Claim \ref{claim2} as a lemma and prove the lemma.
\begin{lemma}\label{lemma1}
For $\varepsilon > 0$, Let $S = \{Z s.t. |\hat{f}(Z)| \ge \frac{\varepsilon}{L_1(f)}\}$,
and let $g(x) = \sum_{Z \in S} \hat{f}(Z) \chi_Z(x)$.  Then, 
$\E[(f-g)^2(x)] \le \varepsilon$
\end{lemma}

\begin{proofof}{Lemma \ref{lemma1}}
\begin{eqnarray*}
(f-g)(x) &=& \sum_{S \notin S} \hat{f}(Z)\chi_Z(x),
\end{eqnarray*}
so,
\begin{eqnarray*} 
\E[(f-g)^2] &=& \sum_{S \notin S} \hat{f}^2(Z)\\
&\le& \max_Z |\hat{f}(Z)| \sum_{Z \notin S} |\hat{f}(Z)| \\ 
&=& \frac{\varepsilon}{L_1(f)} L_1(f) \\
&=& \varepsilon 
\end{eqnarray*}
\end{proofof}

\begin{proofof}{Claim \ref{claim1}}

For a fixed leaf $\ell$, define $g_\ell : \{\pm1\}^n \rightarrow \{0,1\}$ to be
$$g(x) = \cases{1 & \hbox{if $x$ reaches $\ell$} \cr 0 & \hbox{o.w.}}.$$
First we will calculate $L_1(g_\ell)$. 
\\ \\
Let $V_\ell$ be the set of indices of variables viewed on the path to $\ell$.  
So if $\ell$
is of depth $k$ in the tree, then $|V_\ell| = k$ (assuming of course that no variable
is seen multiple times on a path down a tree -- any function computable by a decision
tree is computable by a decision tree that has this property).
\\ \\
To get a sense of the form of $g$, consider how we can represent $g$ for the leftmost and
rightmost leaves: \\

$$ g_{(\ell-left)}(x) = \frac{1-x_{\ell_1}}{2}\frac{1-x_{\ell_2}}{2} \cdots 
\frac{1-x_{\ell_k}}{2}$$\\
$$g_{(\ell-right)}(x) = \frac{1+x_{\ell_1}}{2}\frac{1+x_{\ell_2}}{2} \cdots 
\frac{1+x_{\ell_n}}{2}$$

where $\ell_i$ is the index of the $i^{th}$ variable seen on the path to $\ell$.
In general, $g_\ell(x)$ can be represented in this product form, where a $+$ or $-$ is 
chosen for each factor depending on whether the path to $\ell$ makes a right
or left turn after viewing the corresponding variable.  i.e.,

\begin{eqnarray*}
g_\ell(x) &=& \frac{1}{2^{|V_\ell|}} \prod_{i \in V_\ell} (1 \pm x_i) \cr
              &=& \sum_{S \subseteq V_\ell} (\pm1) \prod_{i \in S} x_i 
\hbox{(multivariate polynomial expansion)} \cr
              &=& \frac{1}{2^{|V_\ell|}} \sum_{S \subseteq V_\ell} \chi_S(x)
\end{eqnarray*}

So, $$L_1(g_\ell) = \frac{1}{2^{|V_\ell|}} \sum_{S \subseteq V_\ell}
= \frac{2^{|V_\ell|}}{2^{|V_\ell|}} = 1$$ \\
\\
Now notice that since $f$ is computable by the decision tree in question, and each
$x$ follows exactly one path, we can represent it as follows:

$$f(x) = \sum_\ell g_{\cal}(x) \hbox{val}(\ell)$$ 
(where $\hbox{val}(\ell)$ is the value of the label of $l$).
\\
Then, by linearity of Fourier coefficients (i.e. $\widehat{f+g} = \hat{f} + \hat{g}$), the 
triangle inequality, and the fact that $\hbox{val}(l) \in \{\pm1\}$, we have for any $T \subseteq [n]$\\

$$ |\hat{f}(T)| = |\sum_\ell \hat{g}_\ell(T) \hbox{val}(\ell) | 
\le \sum_\ell |\hat{g}_\ell (T)|$$ \\

Then finally,
\begin{eqnarray*}
L_1(f) = \sum_T |\hat{f}(T)| &\le& \sum_T \sum_\ell |\hat{g}_\ell (T)| \\
                              &=& \sum_\ell \sum_T |\hat{g}_\ell (T)| \\
                              &=& \sum_\ell L_1(g)  \\
                              &=&  \sum_\ell 1 \\
                              &=& m \\
\end{eqnarray*} 
\end{proofof}

\end{document}




