User Tools

Site Tools


project:hillclimb_dp:draft

Problem Formulation

We use $x$ and $y$ to denote a sentence and a dependency tree respectively. We consider score functions of the form $S(x,y)=\theta\cdot\phi(x,y)$, in which $\phi(x,y)$ may include any local and global features.

Given the parameter $\theta$, the decoding is a structural prediction task which aims to solve the maximization problem:

$$
\tilde{y}\ =\  \text{argmax}_{\,y\in\mathcal{T}(x)}\ \ S(x,y)
$$

For arbitrary score function $S(x,y)$, the maximization problem could be NP-hard even on 2nd order non-projective parsing. However, we are going to show that a greedy hill climbing algorithm can work empirically and theoretically (under a certain condition).


Hill Climbing Algorithm

The Hill Climbing algorithm operates similarly to Gibbs sampling, by changing the head of a modifier word each time.

Let $\mathcal{T}(y,m)$ be the set of trees that can differ from tree y only at the head of word m. The algorithm is as follows,


Input: $\theta,\quad x$
Output: $\tilde{y}\ \approx\ \text{argmax}_{\, y\in\mathcal{Y}}\ \ S(x,y)$
Algorithm:


\begin{enumerate}
\item Randomly initialize tree $y^{(0)}\sim Q$
\item t = 0
\item $\text{Re}$peat $i=1,\ 2,\ \cdots$ do:
\item $\quad\ $ list = reverse depth first traverse of $y^{(t)}$
\item $\quad\ $ For each word $m$ in list do:
\item $\quad\ $$\quad\ $ $y^{(t+1)} = \text{argmax}_{\,y'\in \mathcal{T}(y^{(t)},m)}\ S(x,y')$ 
\item $\quad\ $$\quad\ $ $t = t+1$
\item Until no change has been made in i-th iteration
\item Return $\tilde{y} = y^{(t)}$
\end{enumerate}


Note that (1) the algorithm initializes $y^{(0)}$ from a distribution $Q$; (2) the algorithm repeatedly checks each word in reverse DFS order (thus a word that is far away from the root comes first); and (3) when changing the head of m, it has to keep $y^{(t)}$ a valid tree at any point $t$; (4) it operates until no changes are possible to increase the score.


Justification of the Algorithm

Let us first denote a given sentence as $\hat{x}$ and the corresponding correct dependency tree as $\hat{y}$, which is also provided during training. In testing scenario, we just assume there is such a correct tree $\hat{y}$ but it is unknown.

To justify the algorithm, two questions need to be answered:

  1. Does the algorithm (and under what condition) find the best tree $\hat{y}$ ?
  2. What is and how hard is the learning problem ?


Q1: Sufficient Conditions

We can first consider how things work in a traditional structural prediction model. The traditional model aims to learn a score function that grades $\hat{y}$ higher than all other trees:

\begin{eqnarray*}
S(\hat{x}, \hat{y}) \geq S(\hat{x}, y) + \text{err}(y) \quad\quad y\in \mathcal{T}(\hat{x}) \quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad (1)
\end{eqnarray*}

where $\text{err}(y)$ is the number of wrong dependency arcs in this tree (i.e. $\ell_1$ structural loss). When condition (1) holds, solving the maximization problem would give us $\tilde{y}=\hat{y}$, but the problem is generally NP-hard.


Alternatively, we can force the model to learn a similar but stronger score function, which would make simpler maximization algorithm possible:

\begin{eqnarray*}
S(\hat{x}, y) \geq S(\hat{x}, y') + 1 \quad\quad y \in \mathcal{T}(y',m) \quad \text{err}(y)=\text{err}(y')-1 \quad\quad\quad\quad\quad\quad\quad\quad\quad\ \ (2)
\end{eqnarray*}

What condition (2) states is that, the score will increase by at least 1 if we correct the head of a modifier. Note that condition (2) implies condition (1). If condition (2) holds, we expect the Hill Climbing algorithm should successfully climb to the global optimal $\hat{y}$.

There is a non-trivial point left – we need to prove that when the algorithm tries to correct each head in some order, the structure will remain a valid tree at any point t. The following proof supports the correctness of the Hill Climbing algorithm.

Claim:

Suppose condition (2) holds, the Hill Climbing algorithm terminates in $t=O(n)$ and returns the correct tree $\hat{y}$, where $n$ is the number of words in $x$. Specifically, for $t=0,\ 1,\ \cdots,\ n$:

$$
err(y^{(0)}) \ge err(y^{(1)}) \ge \cdots \ge err(y^{(n)}) = 0
$$

and therefore $y^{(n)}=\hat{y}$.

Proof:


Q2: Learning Problem

To learn the score function given training data $D=\left\lbrace\hat{x_i},\hat{y_i}\right\rbrace$, the standard is to minimize the (hinge) loss on the training data:

$$
\sum_{i=1}^{N} \mathcal{L}(\hat{x_i},\hat{y_i}) + \|\theta\|_2^2
$$

If we are to minimize the loss that corresponds to condition (2), it would be:

$$
\mathcal{L}(\hat{x_i},\hat{y_i}) = \max_{y,y'} \left\lbrace S(\hat{x_i}, y') + 1 - S(\hat{x_i}, y) \right\rbrace \quad\quad\quad (2o)
$$

Unfortunately, it is not clear if there is an efficient way to compute this loss function by finding $y$ and $y'$. Instead, we could use the standard structural learning loss function, which corresponds to condition (1):

$$
\mathcal{L}(\hat{x_i},\hat{y_i}) = \max_{\tilde{y}}\left\lbrace S(\hat{x_i}, \tilde{y}) + \text{err}(\tilde{y})\right\rbrace - S(\hat{x_i}, \hat{y_i}) \quad\quad (1o)
$$

where $\tilde{y}$ is the output of the decoding algorithm.

The connections between the two loss functions are the following: condition (1) is a necessary condition of condition (2), and (1o) is the lower bound of sum of loss (2o) for each two consecutive $(y, y')$ on any “climbing” path from $\tilde{y}$ to $\hat{y}$. Thus, by minimizing (1o) we are making the lower bound of (2o) close to zero. If this lower bound is tight (because $err(\tilde{y})$ is small and the path from $\tilde{y}$ to $\hat{y}$ is short),

project/hillclimb_dp/draft.txt · Last modified: 2014/05/02 14:57 by taolei